[PROPOSAL] Effective storage of duplicates in B-tree index.
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.
1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?
2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README>
(chapter - Differences to the Lehman & Yao algorithm).
In the new version they'll become useless. Am I right?
3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page
at time of insertion.
Now it's fine, because each item corresponds with separate TID. But
posting list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This
way requires a bit more space, but allows to do microvacuum of posting
list/tree.
Which one is better?
--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Hi,
On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.
In general, index size is often a serious issue - cases where indexes
need more space than tables are not quite uncommon in my experience. So
I think the efforts to lower space requirements for indexes are good.
But if we introduce posting lists into btree indexes, how different are
they from GIN? It seems to me that if I create a GIN index (using
btree_gin), I do get mostly the same thing you propose, no?
Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
but the compression can only be effective when there are duplicate rows.
So either the index is not UNIQUE (so the b-tree feature is not needed),
or there are many updates.
Which brings me to the other benefit of btree indexes - they are
designed for high concurrency. How much is this going to be affected by
introducing the posting lists?
kind regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi, Tomas!
On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<
https://github.com/postgres/postgres/blob/master/src/backend/access/gin/READMEand articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.In general, index size is often a serious issue - cases where indexes need
more space than tables are not quite uncommon in my experience. So I think
the efforts to lower space requirements for indexes are good.But if we introduce posting lists into btree indexes, how different are
they from GIN? It seems to me that if I create a GIN index (using
btree_gin), I do get mostly the same thing you propose, no?
Yes, In general GIN is a btree with effective duplicates handling + support
of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.
Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's
not feasible to add gingettuple to GIN. At least with same semantics as it
is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't
have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be
hangle as "a < x" intersect "x < b". That is extremely inefficient. It is
possible to fix. However, there is no clear proposal how to fit this case
into GIN interface, yet.
but the compression can only be effective when there are duplicate rows.
So either the index is not UNIQUE (so the b-tree feature is not needed), or
there are many updates.
From my observations users can use btree_gin only in some cases. They like
compression, but can't use btree_gin mostly because of #1.
Which brings me to the other benefit of btree indexes - they are designed
for high concurrency. How much is this going to be affected by introducing
the posting lists?
I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL, not
btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing:
exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.
One can say that concurrency would became worse because index would become
smaller and number of pages would became smaller too. Since number of pages
would be smaller, backends are more likely concur for the same page. But
this argument can be user against any compression and for any bloat.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 09/01/2015 11:31 AM, Alexander Korotkov wrote:
...
Yes, In general GIN is a btree with effective duplicates handling +
support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And
it's not feasible to add gingettuple to GIN. At least with same
semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does.
Multicolumn GIN is more like set of separate singlecolumn GINs: it
doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would
be hangle as "a < x" intersect "x < b". That is extremely inefficient.
It is possible to fix. However, there is no clear proposal how to fit
this case into GIN interface, yet.but the compression can only be effective when there are duplicate
rows. So either the index is not UNIQUE (so the b-tree feature is
not needed), or there are many updates.From my observations users can use btree_gin only in some cases. They
like compression, but can't use btree_gin mostly because of #1.
Thanks for the explanation! I'm not that familiar with GIN internals,
but this mostly matches my understanding. I have only mentioned UNIQUE
because the lack of gettuple() method seems obvious - and it works fine
when GIN indexes are used as "bitmap indexes".
But you're right - we can't do index only scans on GIN indexes, which is
a huge benefit of btree indexes.
Which brings me to the other benefit of btree indexes - they are
designed for high concurrency. How much is this going to be affected
by introducing the posting lists?I'd notice that current duplicates handling in PostgreSQL is hack over
original btree. It is designed so in btree access method in PostgreSQL,
not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you
have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar
thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of
posting tree.
OK.
One can say that concurrency would became worse because index would
become smaller and number of pages would became smaller too. Since
number of pages would be smaller, backends are more likely concur for
the same page. But this argument can be user against any compression and
for any bloat.
Which might be a problem for some use cases, but I assume we could add
an option disabling this per-index. Probably having it "off" by default,
and only enabling the compression explicitly.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.
So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?
It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.
2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?
I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.
See also, my opinion of how they claim to not need read locks [1]/messages/by-id/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com.
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.
Anyway, the way that I always imagined this would work is a layer
"below" the current implementation. In other words, you could easily
have prefix compression with a prefix that could end at a point within
a reference IndexTuple. It could be any arbitrary point in the second
or subsequent attribute, and would not "care" about the structure of
the IndexTuple when it comes to where attributes begin and end, etc
(although, in reality, in probably would end up caring, because of the
complexity -- not caring is the ideal only, at least to me). As
Alexander pointed out, GIN does not care about composite keys.
That seems quite different to a GIN posting list (something that I
know way less about, FYI). So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.
3. Microvacuum.
Killed items are marked LP_DEAD and could be deleted from separate page at
time of insertion.
Now it's fine, because each item corresponds with separate TID. But posting
list implementation requires another way. I've got two ideas:
First is to mark LP_DEAD only those tuples where all TIDs are not visible.
Second is to add LP_DEAD flag to each TID in posting list(tree). This way
requires a bit more space, but allows to do microvacuum of posting
list/tree.
No real opinion on this point, except that I agree that doing
something is necessary.
Couple of further thoughts on this general topic:
* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.
You should start thinking about how to deal with this in a world where
the physical size could actually be quite variable. The solution is
probably to simply pretend that every IndexTuple is its original size.
This applies to both prefix compression and duplicate suppression, I
suppose.
* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.
[1]: /messages/by-id/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
01.09.2015 21:23, Peter Geoghegan:
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.
Thank you)
So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.
Hmm.. If we are talking about storing duplicates in posting lists (and
trees) as in GIN, I don't see a way how to apply it to separate pages,
while not applying to others. Look some notes below .
2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.
Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.
/*----------
* If we will need to split the page to put the item on this page,
* check whether we can put the tuple somewhere to the right,
* instead. Keep scanning right until we
* (a) find a page with enough free space,
* (b) reach the last page where the tuple can legally go, or
* (c) get tired of searching.
* (c) is not flippant; it is important because if there are many
* pages' worth of equal keys, it's better to split one of the early
* pages than to scan all the way to the end of the run of equal keys
* on every insert. We implement "get tired" as a random choice,
* since stopping after scanning a fixed number of pages wouldn't work
* well (we'd never reach the right-hand side of previously split
* pages). Currently the probability of moving right is set at 0.99,
* which may seem too high to change the behavior much, but it does an
* excellent job of preventing O(N^2) behavior with many equal keys.
*----------
*/
If there is no multiple tuples with the same key, we shouldn't care
about it at all. It would be possible to skip these steps in "effective
B-tree implementation". That's why I want to change btree_version.
So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.
You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal
keys. And allow us to delete repeats.
Index tuples are stored like this:
IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) |
IndexTupleData + Attrs (key)
If all Attrs are equal, it seems reasonable not to repeat them. So we
can store it in following structure:
MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData
It is a posting list. It doesn't require significant changes in index
page layout, because we can use ordinary IndexTupleData for meta
information. Each IndexTupleData has fixed size, so it's easy to handle
posting list as an array.
2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples
representation. Furthermore, any compression leads to extra
computations. Now, I don't have clear idea about how to implement this
technique.
* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.
That is the point where too big posting list transforms to a posting
tree. But I think, that in the first patch, I'll do it another way. Just
by splitting long posting list into 2 lists of appropriate length.
* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.
That is another reason, why I doubt prefix compression, whereas
effective duplicate storage hasn't this problem.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 3, 2015 at 8:35 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.That is another reason, why I doubt prefix compression, whereas effective
duplicate storage hasn't this problem.
Okay. That sounds reasonable. I think duplicate handling is a good project.
A good learning tool for Postgres B-Trees -- or at least one of the
better ones -- is my amcheck tool. See:
https://github.com/petergeoghegan/postgres/tree/amcheck
This is a tool for verifying B-Tree invariants hold, which is loosely
based on pageinspect. It checks that certain conditions hold for
B-Trees. A simple example is that all items on each page be in the
correct, logical order. Some invariants checked are far more
complicated, though, and span multiple pages or multiple levels. See
the source code for exact details. This tool works well when running
the regression tests (see stress.sql -- I used it with pgbench), with
no problems reported last I checked. It often only needs light locks
on relations, and single shared locks on buffers. (Buffers are copied
to local memory for the tool to operate on, much like
contrib/pageinspect).
While I have yet to formally submit amcheck to a CF (I once asked for
input on the goals for the project on -hackers), the comments are
fairly comprehensive, and it wouldn't be too hard to adopt this to
guide your work on duplicate handling. Maybe it'll happen for 9.6.
Feedback appreciated.
The tool calls _bt_compare() for many things currently, but doesn't
care about many lower level details, which is (very roughly speaking)
the level that duplicate handling will work at. You aren't actually
proposing to change anything about the fundamental structure that
B-Tree indexes have, so the tool could be quite useful and low-effort
for debugging your code during development.
Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 27, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.
I should add that the way that the L&Y technique works, and the way
that Postgres code is generally very robust/defensive can make direct
testing a difficult thing. I have seen cases where a completely messed
up B-Tree still gave correct results most of the time, and was just
slower. That can happen, for example, because the "move right" thing
results in a degenerate linear scan of the entire index. The
comparisons in the internal pages were totally messed up, but it
"didn't matter" once a scan could get to leaf pages and could move
right and find the value that way.
I wrote amcheck because I thought it was scary how B-Tree indexes
could be *completely* messed up without it being obvious; what hope is
there of a test finding a subtle problem in their structure, then?
Testing the invariants directly seemed like the only way to have a
chance of not introducing bugs when adding new stuff to the B-Tree
code. I believe that adding optimizations to the B-Tree code will be
important in the next couple of years, and there is no other way to
approach it IMV.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in
B-tree index.
The main idea is to implement posting lists and posting trees for
B-tree index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs).
If too many rows having the same key, multiple pages are allocated for
the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same
way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.
Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions,
but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btree_compression_1.0.patchtext/x-patch; name=btree_compression_1.0.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 77c2fdf..3b61e8f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
typedef struct
{
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel);
+ Relation heapRel,
+ bool *updposing);
static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ bool updposting = false;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
{
TransactionId xwait;
uint32 speculativeToken;
+ bool fakeupdposting = false; /* Never update posting in unique index */
- offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+ offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
checkUnique, &is_unique, &speculativeToken);
@@ -200,8 +204,54 @@ top:
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
- stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ stack, heapRel, &updposting);
+
+ if (IsSystemRelation(rel))
+ updposting = false;
+
+ /*
+ * New tuple has the same key with tuple at the page.
+ * Unite them into one posting.
+ */
+ if (updposting)
+ {
+ Page page;
+ IndexTuple olditup, newitup;
+ ItemPointerData *ipd;
+ int nipd;
+
+ page = BufferGetPage(buf);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if (BtreeTupleIsPosting(olditup))
+ nipd = BtreeGetNPosting(olditup);
+ else
+ nipd = 1;
+
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ /*
+ * Form posting tuple, then delete old tuple and insert posting tuple.
+ */
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+ PageIndexTupleDelete(page, offset);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+ pfree(ipd);
+ pfree(newitup);
+ }
+ else
+ {
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ }
}
else
{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+ Assert (!BtreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel)
+ Relation heapRel,
+ bool *updposting)
{
Buffer buf = *bufptr;
Page page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
newitemoff = firstlegaloff;
else
- newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+ newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
*bufptr = buf;
*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
+ Assert(!BtreeTupleIsPosting(item));
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+ /*
+ * Truncate posting before insert it as a hikey.
+ */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..1a3c82b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,6 +74,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining);
/*
* btbuild() -- build a new btree index.
@@ -948,6 +951,7 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTupleData *remaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -997,31 +1001,62 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ int nipd, nnewipd;
+ ItemPointer newipd;
+
+ nipd = BtreeGetNPosting(itup);
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /* There are still some live tuples in the posting.
+ * 1) form new posting tuple, that contains remaining ipds
+ * 2) delete "old" posting
+ * 3) insert new posting back to the page
+ */
+ remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ PageIndexTupleDelete(page, offnum);
+
+ if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+ elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+ RelationGetRelationName(info->index));
+ }
+ else
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1132,3 +1167,51 @@ btcanreturn(PG_FUNCTION_ARGS)
{
PG_RETURN_BOOL(true);
}
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69a057..ef220b2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access)
{
BTStack stack_in = NULL;
+ bool fakeupdposting = false; /* fake variable for _bt_binsrch */
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Find the appropriate item on the internal page, and get the child
* page that it points to.
*/
- offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+ offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
Buffer buf,
int keysz,
ScanKey scankey,
- bool nextkey)
+ bool nextkey,
+ bool *updposing)
{
Page page;
BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
+ {
+ if (low <= PageGetMaxOffsetNumber(page))
+ {
+ IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+ /* one excessive check of equality. for possible posting tuple update or creation */
+ if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+ && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+ *updposing = true;
+ }
return low;
+ }
/*
* On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
int i;
StrategyNumber strat_total;
BTScanPosItem *currItem;
+ bool fakeupdposing = false; /* fake variable for _bt_binsrch */
Assert(!BTScanPosIsValid(so->currPos));
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
so->markItemIndex = -1; /* ditto */
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+ offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert (!BtreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..79a737f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
+#include "catalog/catalog.h"
/*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it.
+ * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+ * And of course it needn't to keep a list of ipd.
+ * But, if it had a big posting list, there will be plenty of free space on the opage.
+ * So we must split Posting tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ int nipd, ntocut, ntoleave;
+ Size keytupsz;
+ IndexTuple keytup;
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup, newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ /* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* it off the old page, not the new one, in case we are not at leaf
* level.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ /* Prepare SortSupport data for each column */
+ ScanKey indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+ int i;
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport data */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (index->indisunique)
+ {
+ /* Do not use compression for unique indexes. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
+ int32 compare = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ /* compare tuples */
+ compare = 0;
+ for (i = 1; i <= keysz; i++)
+ {
+ compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+ if (compare != 0)
+ break;
+ }
+
+ if (compare == 0)
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples aren't equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
-
/* Close down final pages and write the metapage */
_bt_uppershutdown(wstate, state);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 91331ba..ed3dff7 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2070,3 +2072,71 @@ btoptions(PG_FUNCTION_ARGS)
PG_RETURN_BYTEA_P(result);
PG_RETURN_NULL();
}
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int i;
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index c997545..d79d5cd 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage \
+// ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// (sizeof(ItemPointerData))))
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..8cf0edc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8) /* page contains compressed duplicates (only for leaf pages) */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque) ((opaque)->btpo_flags & BTP_HAS_POSTING)
+
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -536,6 +539,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -548,7 +553,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -649,6 +654,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -705,8 +732,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey, bool* updposting);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -736,6 +763,8 @@ extern void _bt_end_vacuum(Relation rel);
extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtsort.c
On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:
31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting list/tree
and significantly decrease index size. Read more in presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way
as GIN does it.Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)It seems that backward compatibility works well without any changes. But I
haven't tested it properly yet.Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.I'm almost sure, there are still some tiny bugs and missed functions, but
on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets.
Any bug reports and suggestions are welcome.Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?
Thom
28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in
B-tree index.
The main idea is to implement posting lists and posting trees for
B-tree index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That index
tuple points to posting list which contains pointers to heap
tuples (TIDs). If too many rows having the same key, multiple
pages are allocated for the TIDs and these constitute so called
posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row
that we index.
It can possibly cause page split. Because of MVCC even unique
index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the
same way as GIN does it.Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key,
TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql
file.
i - number of distinct values in the index. So i=1 means that all
rows have the same key, and i=10000000 means that all keys are
different.
The other columns contain the index size (MB).i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625You can note that the last row contains the same index sizes for
B-tree, which is quite logical - there is no compression if all
the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size of
compressed btree.I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.Here is a couple of useful queries to inspect the data inside the
index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;And at last, the list of items I'm going to complete in the near
future:
1. Add storage_parameter 'enable_compression' for btree access
method which specifies whether the index handles duplicates.
default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower
with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught
up past commit 65c5fcd35?
Thank you for the notice. New patch is attached.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btree_compression_1.0(rebased).patchtext/x-patch; name="btree_compression_1.0(rebased).patch"Download
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9673fe0..0c8e4fb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -495,7 +495,7 @@ pgss_shmem_startup(void)
info.hash = pgss_hash_fn;
info.match = pgss_match_fn;
pgss_hash = ShmemInitHash("pg_stat_statements hash",
- pgss_max, pgss_max,
+ pgss_max,
&info,
HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..3908cc1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
typedef struct
{
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel);
+ Relation heapRel,
+ bool *updposing);
static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ bool updposting = false;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
{
TransactionId xwait;
uint32 speculativeToken;
+ bool fakeupdposting = false; /* Never update posting in unique index */
- offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+ offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
checkUnique, &is_unique, &speculativeToken);
@@ -200,8 +204,54 @@ top:
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
- stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ stack, heapRel, &updposting);
+
+ if (IsSystemRelation(rel))
+ updposting = false;
+
+ /*
+ * New tuple has the same key with tuple at the page.
+ * Unite them into one posting.
+ */
+ if (updposting)
+ {
+ Page page;
+ IndexTuple olditup, newitup;
+ ItemPointerData *ipd;
+ int nipd;
+
+ page = BufferGetPage(buf);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if (BtreeTupleIsPosting(olditup))
+ nipd = BtreeGetNPosting(olditup);
+ else
+ nipd = 1;
+
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ /*
+ * Form posting tuple, then delete old tuple and insert posting tuple.
+ */
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+ PageIndexTupleDelete(page, offset);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+ pfree(ipd);
+ pfree(newitup);
+ }
+ else
+ {
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ }
}
else
{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+ Assert (!BtreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel)
+ Relation heapRel,
+ bool *updposting)
{
Buffer buf = *bufptr;
Page page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
newitemoff = firstlegaloff;
else
- newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+ newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
*bufptr = buf;
*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
+ Assert(!BtreeTupleIsPosting(item));
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+ /*
+ * Truncate posting before insert it as a hikey.
+ */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..f56c90f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -75,6 +75,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +965,7 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTupleData *remaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1015,62 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ int nipd, nnewipd;
+ ItemPointer newipd;
+
+ nipd = BtreeGetNPosting(itup);
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /* There are still some live tuples in the posting.
+ * 1) form new posting tuple, that contains remaining ipds
+ * 2) delete "old" posting
+ * 3) insert new posting back to the page
+ */
+ remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ PageIndexTupleDelete(page, offnum);
+
+ if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+ elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+ RelationGetRelationName(info->index));
+ }
+ else
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1160,3 +1195,51 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..0428f04 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access)
{
BTStack stack_in = NULL;
+ bool fakeupdposting = false; /* fake variable for _bt_binsrch */
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Find the appropriate item on the internal page, and get the child
* page that it points to.
*/
- offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+ offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
Buffer buf,
int keysz,
ScanKey scankey,
- bool nextkey)
+ bool nextkey,
+ bool *updposing)
{
Page page;
BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
+ {
+ if (low <= PageGetMaxOffsetNumber(page))
+ {
+ IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+ /* one excessive check of equality. for possible posting tuple update or creation */
+ if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+ && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+ *updposing = true;
+ }
return low;
+ }
/*
* On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
int i;
StrategyNumber strat_total;
BTScanPosItem *currItem;
+ bool fakeupdposing = false; /* fake variable for _bt_binsrch */
Assert(!BTScanPosIsValid(so->currPos));
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
so->markItemIndex = -1; /* ditto */
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+ offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert (!BtreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e29d63f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
+#include "catalog/catalog.h"
/*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it.
+ * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+ * And of course it needn't to keep a list of ipd.
+ * But, if it had a big posting list, there will be plenty of free space on the opage.
+ * So we must split Posting tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ int nipd, ntocut, ntoleave;
+ Size keytupsz;
+ IndexTuple keytup;
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup, newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ /* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* it off the old page, not the new one, in case we are not at leaf
* level.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ /* Prepare SortSupport data for each column */
+ ScanKey indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+ int i;
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport data */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (index->indisunique)
+ {
+ /* Do not use compression for unique indexes. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
+ int32 compare = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ /* compare tuples */
+ compare = 0;
+ for (i = 1; i <= keysz; i++)
+ {
+ compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+ if (compare != 0)
+ break;
+ }
+
+ if (compare == 0)
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples aren't equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
-
/* Close down final pages and write the metapage */
_bt_uppershutdown(wstate, state);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..0291342 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2063,3 +2065,71 @@ btoptions(Datum reloptions, bool validate)
{
return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
}
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int i;
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 39e8baf..dd5acb7 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -62,7 +62,7 @@ InitBufTable(int size)
info.num_partitions = NUM_BUFFER_PARTITIONS;
SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
- size, size,
+ size,
&info,
HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
}
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 81506ea..4c18701 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -237,7 +237,7 @@ InitShmemIndex(void)
hash_flags = HASH_ELEM;
ShmemIndex = ShmemInitHash("ShmemIndex",
- SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
+ SHMEM_INDEX_SIZE,
&info, hash_flags);
}
@@ -255,17 +255,12 @@ InitShmemIndex(void)
* exceeded substantially (since it's used to compute directory size and
* the hash table buckets will get overfull).
*
- * init_size is the number of hashtable entries to preallocate. For a table
- * whose maximum size is certain, this should be equal to max_size; that
- * ensures that no run-time out-of-shared-memory failures can occur.
- *
* Note: before Postgres 9.0, this function returned NULL for some failure
* cases. Now, it always throws error instead, so callers need not check
* for NULL.
*/
HTAB *
ShmemInitHash(const char *name, /* table string name for shmem index */
- long init_size, /* initial table size */
long max_size, /* max size of the table */
HASHCTL *infoP, /* info about key and bucket size */
int hash_flags) /* info about infoP */
@@ -299,7 +294,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* Pass location of hashtable header to hash_create */
infoP->hctl = (HASHHDR *) location;
- return hash_create(name, init_size, infoP, hash_flags);
+ return hash_create(name, max_size, infoP, hash_flags);
}
/*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 9c2e49c..8d9b36a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -373,8 +373,7 @@ void
InitLocks(void)
{
HASHCTL info;
- long init_table_size,
- max_table_size;
+ long max_table_size;
bool found;
/*
@@ -382,7 +381,6 @@ InitLocks(void)
* calculations must agree with LockShmemSize!
*/
max_table_size = NLOCKENTS();
- init_table_size = max_table_size / 2;
/*
* Allocate hash table for LOCK structs. This stores per-locked-object
@@ -394,14 +392,12 @@ InitLocks(void)
info.num_partitions = NUM_LOCK_PARTITIONS;
LockMethodLockHash = ShmemInitHash("LOCK hash",
- init_table_size,
max_table_size,
&info,
HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
/* Assume an average of 2 holders per lock */
max_table_size *= 2;
- init_table_size *= 2;
/*
* Allocate hash table for PROCLOCK structs. This stores
@@ -413,7 +409,6 @@ InitLocks(void)
info.num_partitions = NUM_LOCK_PARTITIONS;
LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
- init_table_size,
max_table_size,
&info,
HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d9d4e22..fc72d2d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1116,7 +1116,6 @@ InitPredicateLocks(void)
PredicateLockTargetHash = ShmemInitHash("PREDICATELOCKTARGET hash",
max_table_size,
- max_table_size,
&info,
HASH_ELEM | HASH_BLOBS |
HASH_PARTITION | HASH_FIXED_SIZE);
@@ -1144,7 +1143,6 @@ InitPredicateLocks(void)
PredicateLockHash = ShmemInitHash("PREDICATELOCK hash",
max_table_size,
- max_table_size,
&info,
HASH_ELEM | HASH_FUNCTION |
HASH_PARTITION | HASH_FIXED_SIZE);
@@ -1225,7 +1223,6 @@ InitPredicateLocks(void)
SerializableXidHash = ShmemInitHash("SERIALIZABLEXID hash",
max_table_size,
- max_table_size,
&info,
HASH_ELEM | HASH_BLOBS |
HASH_FIXED_SIZE);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 24a53da..ce9bb9c 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -15,7 +15,7 @@
* to hash_create. This prevents any attempt to split buckets on-the-fly.
* Therefore, each hash bucket chain operates independently, and no fields
* of the hash header change after init except nentries and freeList.
- * A partitioned table uses a spinlock to guard changes of those two fields.
+ * A partitioned table uses spinlocks to guard changes of those fields.
* This lets any subset of the hash buckets be treated as a separately
* lockable partition. We expect callers to use the low-order bits of a
* lookup key's hash value as a partition number --- this will work because
@@ -87,6 +87,7 @@
#include "access/xact.h"
#include "storage/shmem.h"
#include "storage/spin.h"
+#include "storage/lock.h"
#include "utils/dynahash.h"
#include "utils/memutils.h"
@@ -128,12 +129,26 @@ typedef HASHBUCKET *HASHSEGMENT;
*/
struct HASHHDR
{
- /* In a partitioned table, take this lock to touch nentries or freeList */
- slock_t mutex; /* unused if not partitioned table */
-
- /* These fields change during entry addition/deletion */
- long nentries; /* number of entries in hash table */
- HASHELEMENT *freeList; /* linked list of free elements */
+ /*
+ * There are two fields declared below: nentries and freeList. nentries
+ * stores current number of entries in a hash table. freeList is a linked
+ * list of free elements.
+ *
+ * To keep these fields consistent in a partitioned table we need to
+ * synchronize access to them using a spinlock. But it turned out that a
+ * single spinlock can create a bottleneck. To prevent lock contention an
+ * array of NUM_LOCK_PARTITIONS spinlocks is used. Each spinlock
+ * corresponds to a single table partition (see PARTITION_IDX definition)
+ * and protects one element of nentries and freeList arrays. Since
+ * partitions are locked on a calling side depending on lower bits of a
+ * hash value this particular number of spinlocks prevents deadlocks.
+ *
+ * If hash table is not partitioned only nentries[0] and freeList[0] are
+ * used and spinlocks are not used at all.
+ */
+ slock_t mutex[NUM_LOCK_PARTITIONS]; /* array of spinlocks */
+ long nentries[NUM_LOCK_PARTITIONS]; /* number of entries */
+ HASHELEMENT *freeList[NUM_LOCK_PARTITIONS]; /* lists of free elements */
/* These fields can change, but not in a partitioned table */
/* Also, dsize can't change in a shared table, even if unpartitioned */
@@ -166,6 +181,8 @@ struct HASHHDR
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
+#define PARTITION_IDX(hctl, hashcode) (IS_PARTITIONED(hctl) ? LockHashPartition(hashcode) : 0)
+
/*
* Top control structure for a hashtable --- in a shared table, each backend
* has its own copy (OK since no fields change at runtime)
@@ -219,10 +236,10 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem);
+static bool element_alloc(HTAB *hashp, int nelem, int partition_idx);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
-static HASHBUCKET get_hash_entry(HTAB *hashp);
+static HASHBUCKET get_hash_entry(HTAB *hashp, int partition_idx);
static void hdefault(HTAB *hashp);
static int choose_nelem_alloc(Size entrysize);
static bool init_htab(HTAB *hashp, long nelem);
@@ -282,6 +299,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int i,
+ partitions_number,
+ nelem_alloc;
/*
* For shared hash tables, we have a local hash header (HTAB struct) that
@@ -482,10 +502,24 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
if ((flags & HASH_SHARED_MEM) ||
nelem < hctl->nelem_alloc)
{
- if (!element_alloc(hashp, (int) nelem))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ /*
+ * If hash table is partitioned all freeLists have equal number of
+ * elements. Otherwise only freeList[0] is used.
+ */
+ if (IS_PARTITIONED(hashp->hctl))
+ partitions_number = NUM_LOCK_PARTITIONS;
+ else
+ partitions_number = 1;
+
+ nelem_alloc = ((int) nelem) / partitions_number;
+ if (nelem_alloc == 0)
+ nelem_alloc = 1;
+
+ for (i = 0; i < partitions_number; i++)
+ if (!element_alloc(hashp, nelem_alloc, i))
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
}
if (flags & HASH_FIXED_SIZE)
@@ -503,9 +537,6 @@ hdefault(HTAB *hashp)
MemSet(hctl, 0, sizeof(HASHHDR));
- hctl->nentries = 0;
- hctl->freeList = NULL;
-
hctl->dsize = DEF_DIRSIZE;
hctl->nsegs = 0;
@@ -572,12 +603,14 @@ init_htab(HTAB *hashp, long nelem)
HASHSEGMENT *segp;
int nbuckets;
int nsegs;
+ int i;
/*
* initialize mutex if it's a partitioned table
*/
if (IS_PARTITIONED(hctl))
- SpinLockInit(&hctl->mutex);
+ for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+ SpinLockInit(&(hctl->mutex[i]));
/*
* Divide number of elements by the fill factor to determine a desired
@@ -648,7 +681,7 @@ init_htab(HTAB *hashp, long nelem)
"HIGH MASK ", hctl->high_mask,
"LOW MASK ", hctl->low_mask,
"NSEGS ", hctl->nsegs,
- "NENTRIES ", hctl->nentries);
+ "NENTRIES ", hash_get_num_entries(hctl));
#endif
return true;
}
@@ -769,7 +802,7 @@ hash_stats(const char *where, HTAB *hashp)
where, hashp->hctl->accesses, hashp->hctl->collisions);
fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n",
- hashp->hctl->nentries, (long) hashp->hctl->keysize,
+ hash_get_num_entries(hashp), (long) hashp->hctl->keysize,
hashp->hctl->max_bucket, hashp->hctl->nsegs);
fprintf(stderr, "%s: total accesses %ld total collisions %ld\n",
where, hash_accesses, hash_collisions);
@@ -863,6 +896,7 @@ hash_search_with_hash_value(HTAB *hashp,
HASHBUCKET currBucket;
HASHBUCKET *prevBucketPtr;
HashCompareFunc match;
+ int partition_idx = PARTITION_IDX(hctl, hashvalue);
#if HASH_STATISTICS
hash_accesses++;
@@ -885,7 +919,7 @@ hash_search_with_hash_value(HTAB *hashp,
* order of these tests is to try to check cheaper conditions first.
*/
if (!IS_PARTITIONED(hctl) && !hashp->frozen &&
- hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
+ hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}
@@ -943,20 +977,20 @@ hash_search_with_hash_value(HTAB *hashp,
{
/* if partitioned, must lock to touch nentries and freeList */
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ SpinLockAcquire(&(hctl->mutex[partition_idx]));
- Assert(hctl->nentries > 0);
- hctl->nentries--;
+ Assert(hctl->nentries[partition_idx] > 0);
+ hctl->nentries[partition_idx]--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the freelist for this table. */
- currBucket->link = hctl->freeList;
- hctl->freeList = currBucket;
+ currBucket->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = currBucket;
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
/*
* better hope the caller is synchronizing access to this
@@ -982,7 +1016,7 @@ hash_search_with_hash_value(HTAB *hashp,
elog(ERROR, "cannot insert into frozen hashtable \"%s\"",
hashp->tabname);
- currBucket = get_hash_entry(hashp);
+ currBucket = get_hash_entry(hashp, partition_idx);
if (currBucket == NULL)
{
/* out of memory */
@@ -1175,41 +1209,71 @@ hash_update_hash_key(HTAB *hashp,
* create a new entry if possible
*/
static HASHBUCKET
-get_hash_entry(HTAB *hashp)
+get_hash_entry(HTAB *hashp, int partition_idx)
{
- HASHHDR *hctl = hashp->hctl;
+ HASHHDR *hctl = hashp->hctl;
HASHBUCKET newElement;
+ int i,
+ borrow_from_idx;
for (;;)
{
/* if partitioned, must lock to touch nentries and freeList */
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
/* try to get an entry from the freelist */
- newElement = hctl->freeList;
+ newElement = hctl->freeList[partition_idx];
+
if (newElement != NULL)
- break;
+ {
+ /* remove entry from freelist, bump nentries */
+ hctl->freeList[partition_idx] = newElement->link;
+ hctl->nentries[partition_idx]++;
+ if (IS_PARTITIONED(hctl))
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+
+ return newElement;
+ }
- /* no free elements. allocate another chunk of buckets */
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
- if (!element_alloc(hashp, hctl->nelem_alloc))
+ /* no free elements. allocate another chunk of buckets */
+ if (!element_alloc(hashp, hctl->nelem_alloc, partition_idx))
{
- /* out of memory */
- return NULL;
- }
- }
+ if (!IS_PARTITIONED(hctl))
+ return NULL; /* out of memory */
- /* remove entry from freelist, bump nentries */
- hctl->freeList = newElement->link;
- hctl->nentries++;
+ /* try to borrow element from another partition */
+ borrow_from_idx = partition_idx;
+ for (;;)
+ {
+ borrow_from_idx = (borrow_from_idx + 1) % NUM_LOCK_PARTITIONS;
+ if (borrow_from_idx == partition_idx)
+ break;
- if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ SpinLockAcquire(&(hctl->mutex[borrow_from_idx]));
+ newElement = hctl->freeList[borrow_from_idx];
+
+ if (newElement != NULL)
+ {
+ hctl->freeList[borrow_from_idx] = newElement->link;
+ SpinLockRelease(&(hctl->mutex[borrow_from_idx]));
+
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
+ hctl->nentries[partition_idx]++;
+ SpinLockRelease(&hctl->mutex[partition_idx]);
+
+ break;
+ }
- return newElement;
+ SpinLockRelease(&(hctl->mutex[borrow_from_idx]));
+ }
+
+ return newElement;
+ }
+ }
}
/*
@@ -1218,11 +1282,21 @@ get_hash_entry(HTAB *hashp)
long
hash_get_num_entries(HTAB *hashp)
{
+ int i;
+ long sum = hashp->hctl->nentries[0];
+
/*
* We currently don't bother with the mutex; it's only sensible to call
* this function if you've got lock on all partitions of the table.
*/
- return hashp->hctl->nentries;
+
+ if (!IS_PARTITIONED(hashp->hctl))
+ return sum;
+
+ for (i = 1; i < NUM_LOCK_PARTITIONS; i++)
+ sum += hashp->hctl->nentries[i];
+
+ return sum;
}
/*
@@ -1530,9 +1604,9 @@ seg_alloc(HTAB *hashp)
* allocate some new elements and link them into the free list
*/
static bool
-element_alloc(HTAB *hashp, int nelem)
+element_alloc(HTAB *hashp, int nelem, int partition_idx)
{
- HASHHDR *hctl = hashp->hctl;
+ HASHHDR *hctl = hashp->hctl;
Size elementSize;
HASHELEMENT *firstElement;
HASHELEMENT *tmpElement;
@@ -1563,14 +1637,14 @@ element_alloc(HTAB *hashp, int nelem)
/* if partitioned, must lock to touch freeList */
if (IS_PARTITIONED(hctl))
- SpinLockAcquire(&hctl->mutex);
+ SpinLockAcquire(&hctl->mutex[partition_idx]);
/* freelist could be nonempty if two backends did this concurrently */
- firstElement->link = hctl->freeList;
- hctl->freeList = prevElement;
+ firstElement->link = hctl->freeList[partition_idx];
+ hctl->freeList[partition_idx] = prevElement;
if (IS_PARTITIONED(hctl))
- SpinLockRelease(&hctl->mutex);
+ SpinLockRelease(&hctl->mutex[partition_idx]);
return true;
}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..eb4467a 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage \
+// ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// (sizeof(ItemPointerData))))
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..41e407d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8) /* page contains compressed duplicates (only for leaf pages) */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque) ((opaque)->btpo_flags & BTP_HAS_POSTING)
+
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -538,6 +541,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -550,7 +555,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -651,6 +656,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -715,8 +742,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey, bool* updposting);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +774,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 5e8825e..177371b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,13 +128,19 @@ extern char *MainLWLockNames[];
* having this file include lock.h or bufmgr.h would be backwards.
*/
-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 128
-
-/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS 4
+/*
+ * Number of partitions the shared lock tables are divided into.
+ *
+ * This particular number of partitions significantly reduces lock contention
+ * in partitioned hash tables, almost if partitioned tables didn't use any
+ * locking at all.
+ */
+#define LOG2_NUM_LOCK_PARTITIONS 7
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+
/* Number of partitions the shared predicate lock tables are divided into */
#define LOG2_NUM_PREDICATELOCK_PARTITIONS 4
#define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 6468e66..50cf928 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -37,7 +37,7 @@ extern void InitShmemAllocation(void);
extern void *ShmemAlloc(Size size);
extern bool ShmemAddrIsValid(const void *addr);
extern void InitShmemIndex(void);
-extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
+extern HTAB *ShmemInitHash(const char *name, long max_size,
HASHCTL *infoP, int hash_flags);
extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
extern Size add_size(Size s1, Size s2);
On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:
28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in presentation
(part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way
as GIN does it.Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.I'm almost sure, there are still some tiny bugs and missed functions, but
on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets.
Any bug reports and suggestions are welcome.Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?Thank you for the notice. New patch is attached.
Thanks for the quick rebase.
Okay, a quick check with pgbench:
CREATE INDEX ON pgbench_accounts(bid);
Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms
Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)
No performance issues from what I can tell.
I'm surprised that efficiencies can't be realised beyond this point. Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that. I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.
Thom
On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:
I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 January 2016 at 17:09, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:
I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.
I was thinking more from the point of view of the index size. An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 January 2016 at 17:03, Thom Brown <thom@linux.com> wrote:
On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in presentation
(part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous
splits.I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same
way as GIN does it.Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)It seems that backward compatibility works well without any changes. But
I haven't tested it properly yet.Here are some test results. They are obtained by test functions
test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows
have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree,
which is quite logical - there is no compression if all the keys are
distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression
yet. So there is still potential to decrease size of compressed btree.I'm almost sure, there are still some tiny bugs and missed functions,
but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real
datasets. Any bug reports and suggestions are welcome.Here is a couple of useful queries to inspect the data inside the index
pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method
which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with
compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up
past commit 65c5fcd35?Thank you for the notice. New patch is attached.
Thanks for the quick rebase.
Okay, a quick check with pgbench:
CREATE INDEX ON pgbench_accounts(bid);
Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985msSize
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)No performance issues from what I can tell.
I'm surprised that efficiencies can't be realised beyond this point. Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that. I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.
Okay, now for some badness. I've restored a database containing 2 tables,
one 318MB, another 24kB. The 318MB table contains 5 million rows with a
sequential id column. I get a problem if I try to delete many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
The query completes, but I get this message a lot before it does.
This happens even if I drop the primary key and foreign key constraints, so
somehow the memory usage has massively increased with this patch.
Thom
28.01.2016 20:03, Thom Brown:
On 28 January 2016 at 16:12, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:28.01.2016 18:12, Thom Brown:
On 28 January 2016 at 14:06, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru
<mailto:a.lubennikova@postgrespro.ru>> wrote:31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate
keys in B-tree index.
The main idea is to implement posting lists and posting
trees for B-tree index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is
organised as follows.
Index stores single index tuple for each unique key. That
index tuple points to posting list which contains pointers
to heap tuples (TIDs). If too many rows having the same key,
multiple pages are allocated for the TIDs and these
constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to
posting list/tree and significantly decrease index size.
Read more in presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table
row that we index.
It can possibly cause page split. Because of MVCC even
unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid
superfluous splits.I'd like to share the progress of my work. So here is a WIP
patch.
It provides effective duplicate handling using posting lists
the same way as GIN does it.Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) +
key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID
(ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID
(ip_blkid, ip_posid)It seems that backward compatibility works well without any
changes. But I haven't tested it properly yet.Here are some test results. They are obtained by test
functions test_btbuild and test_ginbuild, which you can find
in attached sql file.
i - number of distinct values in the index. So i=1 means that
all rows have the same key, and i=10000000 means that all
keys are different.
The other columns contain the index size (MB).i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625You can note that the last row contains the same index sizes
for B-tree, which is quite logical - there is no compression
if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list
compression yet. So there is still potential to decrease size
of compressed btree.I'm almost sure, there are still some tiny bugs and missed
functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some
real datasets. Any bug reports and suggestions are welcome.Here is a couple of useful queries to inspect the data inside
the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral
bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral
bt_page_items('idx', n) as bt;And at last, the list of items I'm going to complete in the
near future:
1. Add storage_parameter 'enable_compression' for btree
access method which specifies whether the index handles
duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly
slower with compressed btree, which is obviously not what we
do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you
caught up past commit 65c5fcd35?Thank you for the notice. New patch is attached.
Thanks for the quick rebase.
Okay, a quick check with pgbench:
CREATE INDEX ON pgbench_accounts(bid);
Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985msSize
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)No performance issues from what I can tell.
I'm surprised that efficiencies can't be realised beyond this point.
Your results show a sweet spot at around 1000 / 10000000, with it
getting slightly worse beyond that. I kind of expected a lot of
efficiency where all the values are the same, but perhaps that's due
to my lack of understanding regarding the way they're being stored.
Thank you for the prompt reply. I see what you're confused about. I'll
try to clarify it.
First of all, what is implemented in the patch is not actually
compression. It's more about index page layout changes to compact
ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also
known as Posting list).
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
(ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid,
ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
TID (N item pointers, posting list offset) - this is the meta
information. So, we have to store this meta information in addition to
useful data.
Next point is the requirement of having minimum three tuples in a page.
We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.
/*
* Maximum size of a btree index entry, including its tuple header.
*
* We actually need to be able to fit three items on every page,
* so restrict any one item to 1/3 the per-page available space.
*/
#define BTMaxItemSize(page) \
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
Although, I thought just now that this size could be increased for
compressed tuples, at least for leaf pages.
That's the reason, why we have to store more meta information than meets
the eye.
For example, we have 100000 of duplicates with the same key. It seems
that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size
of meta TID) + keysize instead of 600000.
But, we have to split one huge posting list into the smallest ones to
fit it into the index page.
It depends on the key size, of course. As I can see form pageisnpect the
index on the single integer key have to split the tuples into the pieces
with the size 2704 containing 447 TIDs in one posting list.
So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is
really less impressive than expected.
There is an idea of posting trees in GIN. Key is stored just once, and
posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here
http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s
not supposed to handle concurrent insertions.
As I mentioned before I'm going to implement prefix compression of
posting list, which must be efficient and quite simple, since it's
already implemented in GIN. You can find the presentation about it here
https://www.pgcon.org/2014/schedule/events/698.en.html
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.
Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memoryI didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?
Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:
shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6
Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):
createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench
And delete some stuff:
thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.
➤ psql://thom@[local]:5530/pgbench
# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 ms
There were 358 lines of that warning message. I don't get these
messages without the patch.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
29.01.2016 19:01, Thom Brown:
On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.
Thank you for testing. I rechecked that, and insertions are really very
very very slow. It seems like a bug.
Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memoryI didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbenchAnd delete some stuff:
thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.➤ psql://thom@[local]:5530/pgbench
# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 msThere were 358 lines of that warning message. I don't get these
messages without the patch.Thom
Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.
I hope I'll fix these issueswithin the next few days.
BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btree_compression_2.0.patchtext/x-patch; name=btree_compression_2.0.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..3908cc1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,7 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
typedef struct
{
@@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel);
+ Relation heapRel,
+ bool *updposing);
static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
@@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ bool updposting = false;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -162,8 +165,9 @@ top:
{
TransactionId xwait;
uint32 speculativeToken;
+ bool fakeupdposting = false; /* Never update posting in unique index */
- offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
+ offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, &fakeupdposting);
xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
checkUnique, &is_unique, &speculativeToken);
@@ -200,8 +204,54 @@ top:
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
- stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ stack, heapRel, &updposting);
+
+ if (IsSystemRelation(rel))
+ updposting = false;
+
+ /*
+ * New tuple has the same key with tuple at the page.
+ * Unite them into one posting.
+ */
+ if (updposting)
+ {
+ Page page;
+ IndexTuple olditup, newitup;
+ ItemPointerData *ipd;
+ int nipd;
+
+ page = BufferGetPage(buf);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if (BtreeTupleIsPosting(olditup))
+ nipd = BtreeGetNPosting(olditup);
+ else
+ nipd = 1;
+
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ /*
+ * Form posting tuple, then delete old tuple and insert posting tuple.
+ */
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+ PageIndexTupleDelete(page, offset);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, newitup, offset, false);
+
+ pfree(ipd);
+ pfree(newitup);
+ }
+ else
+ {
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+ }
}
else
{
@@ -306,6 +356,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+
+ Assert (!BtreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -535,7 +587,8 @@ _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
BTStack stack,
- Relation heapRel)
+ Relation heapRel,
+ bool *updposting)
{
Buffer buf = *bufptr;
Page page = BufferGetPage(buf);
@@ -681,7 +734,7 @@ _bt_findinsertloc(Relation rel,
else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
newitemoff = firstlegaloff;
else
- newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+ newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false, updposting);
*bufptr = buf;
*offsetptr = newitemoff;
@@ -1042,6 +1095,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
+ Assert(!BtreeTupleIsPosting(item));
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1128,40 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+ /*
+ * Truncate posting before insert it as a hikey.
+ */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..f56c90f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -75,6 +75,9 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +965,7 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTupleData *remaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1015,62 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ int nipd, nnewipd;
+ ItemPointer newipd;
+
+ nipd = BtreeGetNPosting(itup);
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /* There are still some live tuples in the posting.
+ * 1) form new posting tuple, that contains remaining ipds
+ * 2) delete "old" posting
+ * 3) insert new posting back to the page
+ */
+ remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ PageIndexTupleDelete(page, offnum);
+
+ if (PageAddItem(page, (Item) remaining, IndexTupleSize(remaining), offnum, false, false) != offnum)
+ elog(ERROR, "failed to add vacuumed posting tuple to index page in \"%s\"",
+ RelationGetRelationName(info->index));
+ }
+ else
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1160,3 +1195,51 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+
+/*
+ * Vacuums a posting list. The size of the list must be specified
+ * via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..0428f04 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -90,6 +92,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access)
{
BTStack stack_in = NULL;
+ bool fakeupdposting = false; /* fake variable for _bt_binsrch */
/* Get the root page to start with */
*bufP = _bt_getroot(rel, access);
@@ -136,7 +139,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Find the appropriate item on the internal page, and get the child
* page that it points to.
*/
- offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+ offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey, &fakeupdposting);
itemid = PageGetItemId(page, offnum);
itup = (IndexTuple) PageGetItem(page, itemid);
blkno = ItemPointerGetBlockNumber(&(itup->t_tid));
@@ -310,7 +313,8 @@ _bt_binsrch(Relation rel,
Buffer buf,
int keysz,
ScanKey scankey,
- bool nextkey)
+ bool nextkey,
+ bool *updposing)
{
Page page;
BTPageOpaque opaque;
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
+ {
+ if (low <= PageGetMaxOffsetNumber(page))
+ {
+ IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low));
+ /* one excessive check of equality. for possible posting tuple update or creation */
+ if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+ && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page)))
+ *updposing = true;
+ }
return low;
+ }
/*
* On a non-leaf page, return the last key < scan key (resp. <= scan key).
@@ -536,6 +550,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
int i;
StrategyNumber strat_total;
BTScanPosItem *currItem;
+ bool fakeupdposing = false; /* fake variable for _bt_binsrch */
Assert(!BTScanPosIsValid(so->currPos));
@@ -1003,7 +1018,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
so->markItemIndex = -1; /* ditto */
/* position to the precise item on the page */
- offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+ offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey, &fakeupdposing);
/*
* If nextkey = false, we are positioned at the first item >= scan key, or
@@ -1161,6 +1176,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1211,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1232,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1228,7 +1256,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1236,7 +1264,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1246,8 +1274,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1261,8 +1301,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1275,6 +1315,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert (!BtreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1288,6 +1330,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e29d63f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,6 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
+#include "catalog/catalog.h"
/*
@@ -527,15 +528,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it.
+ * Because HIKEY is not considered as real data, and it needn't to keep any ItemPointerData at all.
+ * And of course it needn't to keep a list of ipd.
+ * But, if it had a big posting list, there will be plenty of free space on the opage.
+ * So we must split Posting tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ int nipd, ntocut, ntoleave;
+ Size keytupsz;
+ IndexTuple keytup;
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup, newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ /* 5) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a normal tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ /* 4)Insert keytup as P_HIKEY. */
+ _bt_sortaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +653,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* it off the old page, not the new one, in case we are not at leaf
* level.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +699,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +801,67 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ /* Prepare SortSupport data for each column */
+ ScanKey indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ SortSupport sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+ int i;
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey i
+ */
+int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +871,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport data */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +897,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +910,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,19 +950,137 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (index->indisunique)
+ {
+ /* Do not use compression for unique indexes. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
+ int32 compare = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ /* compare tuples */
+ compare = 0;
+ for (i = 1; i <= keysz; i++)
+ {
+ compare = _bt_call_comparator(sortKeys, i, itup, itupprev, tupdes);
+ if (compare != 0)
+ break;
+ }
+
+ if (compare == 0)
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples aren't equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
-
/* Close down final pages and write the metapage */
_bt_uppershutdown(wstate, state);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..0291342 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2063,3 +2065,71 @@ btoptions(Datum reloptions, bool validate)
{
return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
}
+
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int i;
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..eb4467a 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -137,7 +137,12 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+// #define MaxIndexTuplesPerPage \
+// ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+// (sizeof(ItemPointerData))))
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..41e407d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -75,6 +75,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
+#define BTP_HAS_POSTING (1 << 8) /* page contains compressed duplicates (only for leaf pages) */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -181,6 +182,8 @@ typedef struct BTMetaPageData
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+#define P_HAS_POSTING(opaque) ((opaque)->btpo_flags & BTP_HAS_POSTING)
+
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -538,6 +541,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -550,7 +555,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -651,6 +656,28 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -715,8 +742,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey, bool* updposting);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +774,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtvalidate.c
On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
29.01.2016 19:01, Thom Brown:
On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.Thank you for testing. I rechecked that, and insertions are really very very
very slow. It seems like a bug.Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memoryI didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbenchAnd delete some stuff:
thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.➤ psql://thom@[local]:5530/pgbench
# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 msThere were 358 lines of that warning message. I don't get these
messages without the patch.Thom
Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.I hope I'll fix these issueswithin the next few days.
BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).
Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
29.01.2016 20:43, Thom Brown:
On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:29.01.2016 19:01, Thom Brown:
On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.Thank you for testing. I rechecked that, and insertions are really very very
very slow. It seems like a bug.Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memoryI didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbenchAnd delete some stuff:
thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.➤ psql://thom@[local]:5530/pgbench
# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 msThere were 358 lines of that warning message. I don't get these
messages without the patch.Thom
Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.I hope I'll fix these issueswithin the next few days.
BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.
Huh, this patch seems to be enchanted) It works fine for me. Did you
perform "make distclean"?
Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about
the issue.
I even almost fixed the insert speed problem. But I'm very very busy
this week.
I'll send an updated patch next week as soon as possible.
Thank you for attention to this work.
--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2 February 2016 at 11:47, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
29.01.2016 20:43, Thom Brown:
On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:29.01.2016 19:01, Thom Brown:
On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower
after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.Thank you for testing. I rechecked that, and insertions are really very
very
very slow. It seems like a bug.Okay, now for some badness. I've restored a database containing 2
tables, one 318MB, another 24kB. The 318MB table contains 5 million
rows with a sequential id column. I get a problem if I try to delete
many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memoryI didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbenchAnd delete some stuff:
thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.➤ psql://thom@[local]:5530/pgbench
# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
...
WARNING: out of shared memory
WARNING: out of shared memory
DELETE 6666667
Time: 22218.804 msThere were 358 lines of that warning message. I don't get these
messages without the patch.Thom
Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.I hope I'll fix these issueswithin the next few days.
BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).Thanks. Well I've tested this latest patch, and the warnings are no
longer generated. However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.Huh, this patch seems to be enchanted) It works fine for me. Did you perform
"make distclean"?
Yes. Just tried it again:
git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench
$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.
➤ psql://thom@[local]:5488/pgbench
# \di+
List of relations
Schema | Name | Type | Owner | Table |
Size | Description
--------+-----------------------+-------+-------+------------------+--------+-------------
public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB |
public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB |
public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB |
(3 rows)
Previously, this would show an index size of 87MB for pgbench_accounts_pkey.
Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about the
issue.
I even almost fixed the insert speed problem. But I'm very very busy this
week.
I'll send an updated patch next week as soon as possible.
Thanks.
Thank you for attention to this work.
Thanks for your awesome patches.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:
public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB |
public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB |
public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB |
I see the same.
I use my regular SQL query to see the breakdown of leaf/internal/root pages:
postgres=# with tots as (
SELECT count(*) c,
avg(live_items) avg_live_items,
avg(dead_items) avg_dead_items,
u.type,
r.oid
from (select c.oid,
c.relpages,
generate_series(1, c.relpages - 1) i
from pg_index i
join pg_opclass op on i.indclass[0] = op.oid
join pg_am am on op.opcmethod = am.oid
join pg_class c on i.indexrelid = c.oid
where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
group by r.oid, type)
select ct.relname table_name,
tots.oid::regclass::text index_name,
(select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
upper(type) page_type,
c npages,
to_char(avg_live_items, '990.999'),
to_char(avg_dead_items, '990.999'),
to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
from tots
join pg_index i on i.indexrelid = tots.oid
join pg_class ct on ct.oid = i.indrelid
where tots.oid = 'pgbench_accounts_pkey'::regclass
order by ct.relnamespace, table_name, index_name, npages, type;
table_name │ index_name │ non_meta_pages │ page_type
│ npages │ to_char │ to_char │ prop_of_index
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│ 1 │ 97.000 │ 0.000 │ 0.004 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │ 282.670 │ 0.000 │ 0.354 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │ 366.992 │ 0.000 │ 99.643 %
(3 rows)
But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).
Are you sure you have the details right, Thom?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4 February 2016 at 15:07, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:
public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB |
public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB |
public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB |I see the same.
I use my regular SQL query to see the breakdown of leaf/internal/root pages:
postgres=# with tots as (
SELECT count(*) c,
avg(live_items) avg_live_items,
avg(dead_items) avg_dead_items,
u.type,
r.oid
from (select c.oid,
c.relpages,
generate_series(1, c.relpages - 1) i
from pg_index i
join pg_opclass op on i.indclass[0] = op.oid
join pg_am am on op.opcmethod = am.oid
join pg_class c on i.indexrelid = c.oid
where am.amname = 'btree') r,
lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
group by r.oid, type)
select ct.relname table_name,
tots.oid::regclass::text index_name,
(select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
upper(type) page_type,
c npages,
to_char(avg_live_items, '990.999'),
to_char(avg_dead_items, '990.999'),
to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
%' as prop_of_index
from tots
join pg_index i on i.indexrelid = tots.oid
join pg_class ct on ct.oid = i.indrelid
where tots.oid = 'pgbench_accounts_pkey'::regclass
order by ct.relnamespace, table_name, index_name, npages, type;
table_name │ index_name │ non_meta_pages │ page_type
│ npages │ to_char │ to_char │ prop_of_index
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R
│ 1 │ 97.000 │ 0.000 │ 0.004 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I
│ 97 │ 282.670 │ 0.000 │ 0.354 %
pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L
│ 27,323 │ 366.992 │ 0.000 │ 99.643 %
(3 rows)But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).Are you sure you have the details right, Thom?
*facepalm*
No, I'm not. I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense. I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.
Now I've checked it with the latest patch, and can see it working
fine. Apologies for the confusion.
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown <thom@linux.com> wrote:
No, I'm not. I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense. I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.
Right. Because, apart from everything else, unique indexes are not
currently supported.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I fixed it in the new version (attached).
Some quick remarks on your V2.0:
* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:
@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,
* scan key), which could be the last slot + 1.
*/
if (P_ISLEAF(opaque))
+ {
+ if (low <= PageGetMaxOffsetNumber(page))
+ {
+ IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+ /* one excessive check of equality. for possible posting
tuple update or creation */
+ if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+ && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+ *updposing = true;
+ }
return low;
+ }
* ISTM that you should not use _bt_compare() above, in any case. Consider this:
postgres=# select 5.0 = 5.000;
?column?
──────────
t
(1 row)
B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.
* Is this correct?:
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
* it off the old page, not the new one, in case we are not at leaf
* level.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
How this code has changed from the master branch is not clear to me.
I understand that this code in incomplete/draft:
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?
* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.
* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.
Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.
It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.
* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?
* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.
* Are you keeping the restriction of 1/3 of a buffer page, but that
just includes the posting list now? That's the kind of detail I'd like
to see in the README now.
* Why not support unique indexes? The obvious answer is that it isn't
worth it, but why? How useful would that be (a bit, just not enough)?
What's the trade-off?
Anyway, this is really cool work; I have often thought that we don't
have nearly enough people thinking about how to optimize B-Tree
indexing. It is hard, but so is anything worthwhile.
That's all I have for now. Just a quick review focused on code and
correctness (and not on the benefits). I want to do more on this,
especially the benefits, because it deserves more attention.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0
<https://vk.com/away.php?to=https%3A%2F%2Fgoo.gl%2F50O8Q0> (the same
text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.
Some quick remarks on your V2.0:
* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel, * scan key), which could be the last slot + 1. */ if (P_ISLEAF(opaque)) + { + if (low <= PageGetMaxOffsetNumber(page)) + { + IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low)); + /* one excessive check of equality. for possible posting tuple update or creation */ + if ((_bt_compare(rel, keysz, scankey, page, low) == 0) + && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page))) + *updposing = true; + } return low; + }* ISTM that you should not use _bt_compare() above, in any case. Consider this:
postgres=# select 5.0 = 5.000;
?column?
──────────
t
(1 row)B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.
Thank you for the notice. Fixed.
* Is this correct?:
@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) * it off the old page, not the new one, in case we are not at leaf * level. */ - state->btps_minkey = CopyIndexTuple(oitup); + ItemId iihk = PageGetItemId(opage, P_HIKEY); + IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk); + state->btps_minkey = CopyIndexTuple(hikey);How this code has changed from the master branch is not clear to me.
Yes, it is. I completed the comment above.
I understand that this code in incomplete/draft:
+#define MaxPackedIndexTuplesPerPage \ + ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ + (sizeof(ItemPointerData))))But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?
I agree.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btc_readme_1.0.patchtext/x-patch; name=btc_readme_1.0.patchDownload
Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple = IndexTuple (with metadata in TID field+ key) + PostingList
1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.
2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.
3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion.
In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).
4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.
5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxposting size id computed as BTMaxItemSize - size of regular index tuple) or if the following tuple is not equal to the previous, we should create packed tuple using BtreeFormPackedTuple on posting list (if any) and insert it into a page. The same we do if there are no more elements in the spool.
6. High key is not a real data, but just an upper bound of the keys that allowed on the page. So there’s no need to compress it. While copying a posting tuple into a high key, we should to get rid of posting list. A posting tuple should be truncated to length of a regular tuple, and the metadata in its TID field should be set with appropriate values. It’s worth to mention here a very specific point in _bt_buildadd(). If current page is full (there is no room for a new tuple), we copy the last item on the page into the new page, and then rearrange the old page so that the 'last item' becomes its high key rather than a true data item. If the last tuple was compressed, we can truncate it before setting as a high key. But, if it had a big posting list, there will be plenty of free space on the original page. So we must split Posting tuple into 2 pieces. see the picture below and comments in the code. I’m not sure about correctness of locking here, but I assume that there are no possible concurrent operations while building index. Is it right?
7. Another difference between gin and btree is that item pointers in gin posting list/tree are always ordered while btree doesn’t require this strictly. If there are many duplicates in btree, we don’t bother to find the ideal place to keep TIDs ordered. The insertion has a choice whether or not to move right. Currently, we just try to find a page where there is room for the new key. The next TODO item is to keep item pointers in posting list ordered. The advantage here is that the best compression of posting list could be reached on sorted TIDs. What do you think about it?
8. Insertion. After we found the sutable place for insertion, check, whether the previous item has the same key. If so, and if there is enough room to add a pointer into the page, we can add it into item. There are two possible cases. If old item is a regular tuple, we should form new compressed tuple. Note, that this case requires to have enough space for two TIDs (metadata and new TID). Otherwise, we just add the pointer into existing posting list. Then delete old tuple and insert the new one.
9. Search. Fortunately, it’s quite easy to change search algorithm. If compressed tuple is found, just go over all TIDs and return them. If an index-only scan is processed, just return the same tuple N times in a row. To avoid storing duplicates in currTuples array, save the key once and then connect it with posting TIDs using tupleOffset. It’s clear that if compression is applied, the page could contain more tuples than if it has only uncompressed tuples. That is why MaxPackedIndexTuplesPerPage appears. Array items (which actually has currTuples and tupleOffset) in BTScanPos is preallocated with length = MaxPackedIndexTuplesPerPage, because we must be sure that all items would fit into the array.
10. Split. The only change in this section is posting list truncation before insert the tuple as a high key.
11. Vacuum. Check all TIDs in a posting list. If there are no live items in the compressed tuple, delete the tuple. Otherwise do the following: form new posting tuple, that contains remaining item pointers; delete "old" posting; insert new posting back to the page. Microvacuum of compressed tuples is not implemented yet. It’s possible to use high bit of offset field of item pointer to flag killed items. But it requires additional performance testing.
12. Locking. Compressed index tuples use the same functions of insertion and deletion as regular index tuples. Most of the operations are performed inside standart functions and don’t need any specific locks. Although this issue defenitely requires more properly testing and review. All the operations where posting tuple is updated in place (deleted and then inserted again with new set of item pointers in posting list) are performed with special function _bt_pgupdtup(). As well as operation, where we want to replace one tuple with another one e.g. in btvacuumpage() and _bt_buildadd (see issue related to high key).
13. Xlog. TODO.
btree_compression_3.0.patchtext/x-patch; name=btree_compression_3.0.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..d6922d5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
+#include "utils/datum.h"
typedef struct
{
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
int keysz, ScanKey scankey);
+
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ Page page;
+ TupleDesc itupdesc;
+ int nipd;
+ IndexTuple olditup;
+ Size sizetoadd;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
if (checkUnique != UNIQUE_CHECK_EXISTING)
{
+ bool updposting = false;
/*
* The only conflict predicate locking cares about for indexes is when
* an index tuple insert conflicts with an existing lock. Since the
@@ -201,7 +210,45 @@ top:
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+ /*
+ * Decide, whether we can apply compression
+ */
+ page = BufferGetPage(buf);
+
+ if(!IsSystemRelation(rel)
+ && !rel->rd_index->indisunique
+ && offset != InvalidOffsetNumber
+ && offset <= PageGetMaxOffsetNumber(page))
+ {
+ itupdesc = RelationGetDescr(rel);
+ sizetoadd = sizeof(ItemPointerData);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if(_bt_isbinaryequal(itupdesc, olditup,
+ rel->rd_index->indnatts, itup))
+ {
+ if (!BtreeTupleIsPosting(olditup))
+ {
+ nipd = 1;
+ sizetoadd = sizetoadd*2;
+ }
+ else
+ nipd = BtreeGetNPosting(olditup);
+
+ if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+ && PageGetFreeSpace(page) > sizetoadd)
+ updposting = true;
+ }
+ }
+
+ if (updposting)
+ {
+ _bt_pgupdtup(rel, page, offset, itup, true, olditup, nipd);
+ _bt_relbuf(rel, buf);
+ }
+ else
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -1042,6 +1089,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1120,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+
+ /* Truncate posting before insert it as a hikey. */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
@@ -2103,6 +2177,76 @@ _bt_pgaddtup(Page page,
}
/*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ * If new tuple to insert is equal to the tuple that already exists on the page,
+ * we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * concat - this flag shows, whether we should add new item to existing one
+ * or just replace old tuple with the new value. If concat is false, the
+ * following fields are senseless.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+ bool concat, IndexTuple olditup, int nipd)
+{
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Size newitupsz;
+
+ if (concat)
+ {
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+ /*
+ * Update the tuple in place. We have already checked that the
+ * new tuple would fit into this page, so it's safe to delete
+ * old tuple and insert the new one without any side effects.
+ */
+ newitupsz = IndexTupleDSize(*newitup);
+ newitupsz = MAXALIGN(newitupsz);
+ }
+ else
+ {
+ newitup = itup;
+ newitupsz = IndexTupleSize(itup);
+ }
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, offset);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ //TODO add Xlog stuff
+
+ END_CRIT_SECTION();
+
+ if (concat)
+ {
+ pfree(ipd);
+ pfree(newitup);
+ }
+}
+
+/*
* _bt_isequal - used in _bt_doinsert in check for duplicates.
*
* This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2295,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
}
/*
+ * _bt_isbinaryequal - used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+ int nindatts, IndexTuple ituptoinsert)
+{
+ AttrNumber attno;
+
+ for (attno = 1; attno <= nindatts; attno++)
+ {
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+ datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attno <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+ return false;
+ }
+ else
+ {
+ Assert(attno <= itupdesc->natts);
+ att = itupdesc->attrs[attno - 1];
+ if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ return false;
+ }
+ }
+
+ /* if we get here, the keys are equal */
+ return true;
+}
+
+/*
* _bt_vacuum_one_page - vacuum just one index page.
*
* Try to remove LP_DEAD items from the given page. The passed buffer
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..a08c500 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+ ItemPointerData *items,int nitem, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +963,7 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1013,58 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ ItemPointer newipd;
+ int nipd,
+ nnewipd;
+
+ nipd = BtreeGetNPosting(itup);
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /* There are still some live tuples in the posting.
+ * 1) form new posting tuple, that contains remaining ipds
+ * 2) delete "old" posting and insert new posting back to the page
+ */
+ remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ _bt_pgupdtup(info->index, page, offnum, remaining, false, NULL, 0);
+ }
+ else
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1160,3 +1189,50 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..301c019 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1161,6 +1163,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1198,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1219,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1228,7 +1243,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1236,7 +1251,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1246,8 +1261,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1261,8 +1288,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1288,6 +1315,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e46930b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
/*
* Status record for spooling/sorting phase. (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it, because HIKEY
+ * is not considered as real data, and it need not to keep any
+ * ItemPointerData at all. And of course it need not to keep
+ * a list of ipd.
+ * But, if it had a big posting list, there will be plenty of
+ * free space on the opage. In that case we must split posting
+ * tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ IndexTuple keytup;
+ Size keytupsz;
+ int nipd,
+ ntocut,
+ ntoleave;
+
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup,
+ newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+ /* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Save a copy of the minimum key for the new page. We have to copy
* it off the old page, not the new one, in case we are not at leaf
* level.
+ * We can not just copy oitup, because it could be posting tuple
+ * and it's more safe just to get new inserted hikey.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ ScanKey indexScanKey;
+ SortSupport sortKeys;
+ int i;
+
+ /* Prepare SortSupport data for each column */
+ indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport structure for indextuples comparison */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (IsSystemRelation(indexRelation) || index->indisunique)
+ {
+ /* Do not use compression. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples are not equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..8c9dda1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2063,3 +2065,69 @@ btoptions(Datum reloptions, bool validate)
{
return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
}
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..16a23b2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -538,6 +538,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -550,7 +552,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -651,6 +653,36 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
+/*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -684,6 +716,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+ bool concat, IndexTuple olditup, int nipd);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
/*
* prototypes for functions in nbtpage.c
@@ -715,8 +750,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +782,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtvalidate.c
18.02.2016 20:18, Anastasia Lubennikova:
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.
Sorry, previous patch was dirty. Hotfix is attached.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btc_readme_1.0.patchtext/x-patch; name=btc_readme_1.0.patchDownload
Compression. To be correct, it’s not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple = IndexTuple (with metadata in TID field+ key) + PostingList
1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. That’s why we can’t just copy the gin implementation of item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if there’s more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I don’t sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that it’s worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isn’t a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I don’t see good reasons to do that.
2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I don’t sure whether it’s worth to handle this case separately because it’s really rare and I consider that it’s the DBA’s job to disable compression on such indexes. But if you see any clear way to do this, it would be great.
3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
There’s no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion.
In btree we can’t afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).
4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: “We actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.” That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ ⅓ pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.
5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxposting size id computed as BTMaxItemSize - size of regular index tuple) or if the following tuple is not equal to the previous, we should create packed tuple using BtreeFormPackedTuple on posting list (if any) and insert it into a page. The same we do if there are no more elements in the spool.
6. High key is not a real data, but just an upper bound of the keys that allowed on the page. So there’s no need to compress it. While copying a posting tuple into a high key, we should to get rid of posting list. A posting tuple should be truncated to length of a regular tuple, and the metadata in its TID field should be set with appropriate values. It’s worth to mention here a very specific point in _bt_buildadd(). If current page is full (there is no room for a new tuple), we copy the last item on the page into the new page, and then rearrange the old page so that the 'last item' becomes its high key rather than a true data item. If the last tuple was compressed, we can truncate it before setting as a high key. But, if it had a big posting list, there will be plenty of free space on the original page. So we must split Posting tuple into 2 pieces. see the picture below and comments in the code. I’m not sure about correctness of locking here, but I assume that there are no possible concurrent operations while building index. Is it right?
7. Another difference between gin and btree is that item pointers in gin posting list/tree are always ordered while btree doesn’t require this strictly. If there are many duplicates in btree, we don’t bother to find the ideal place to keep TIDs ordered. The insertion has a choice whether or not to move right. Currently, we just try to find a page where there is room for the new key. The next TODO item is to keep item pointers in posting list ordered. The advantage here is that the best compression of posting list could be reached on sorted TIDs. What do you think about it?
8. Insertion. After we found the sutable place for insertion, check, whether the previous item has the same key. If so, and if there is enough room to add a pointer into the page, we can add it into item. There are two possible cases. If old item is a regular tuple, we should form new compressed tuple. Note, that this case requires to have enough space for two TIDs (metadata and new TID). Otherwise, we just add the pointer into existing posting list. Then delete old tuple and insert the new one.
9. Search. Fortunately, it’s quite easy to change search algorithm. If compressed tuple is found, just go over all TIDs and return them. If an index-only scan is processed, just return the same tuple N times in a row. To avoid storing duplicates in currTuples array, save the key once and then connect it with posting TIDs using tupleOffset. It’s clear that if compression is applied, the page could contain more tuples than if it has only uncompressed tuples. That is why MaxPackedIndexTuplesPerPage appears. Array items (which actually has currTuples and tupleOffset) in BTScanPos is preallocated with length = MaxPackedIndexTuplesPerPage, because we must be sure that all items would fit into the array.
10. Split. The only change in this section is posting list truncation before insert the tuple as a high key.
11. Vacuum. Check all TIDs in a posting list. If there are no live items in the compressed tuple, delete the tuple. Otherwise do the following: form new posting tuple, that contains remaining item pointers; delete "old" posting; insert new posting back to the page. Microvacuum of compressed tuples is not implemented yet. It’s possible to use high bit of offset field of item pointer to flag killed items. But it requires additional performance testing.
12. Locking. Compressed index tuples use the same functions of insertion and deletion as regular index tuples. Most of the operations are performed inside standart functions and don’t need any specific locks. Although this issue defenitely requires more properly testing and review. All the operations where posting tuple is updated in place (deleted and then inserted again with new set of item pointers in posting list) are performed with special function _bt_pgupdtup(). As well as operation, where we want to replace one tuple with another one e.g. in btvacuumpage() and _bt_buildadd (see issue related to high key).
13. Xlog. TODO.
btree_compression_3.1.patchtext/x-patch; name=btree_compression_3.1.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..d6922d5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
+#include "utils/datum.h"
typedef struct
{
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
int keysz, ScanKey scankey);
+
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ Page page;
+ TupleDesc itupdesc;
+ int nipd;
+ IndexTuple olditup;
+ Size sizetoadd;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
if (checkUnique != UNIQUE_CHECK_EXISTING)
{
+ bool updposting = false;
/*
* The only conflict predicate locking cares about for indexes is when
* an index tuple insert conflicts with an existing lock. Since the
@@ -201,7 +210,45 @@ top:
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+ /*
+ * Decide, whether we can apply compression
+ */
+ page = BufferGetPage(buf);
+
+ if(!IsSystemRelation(rel)
+ && !rel->rd_index->indisunique
+ && offset != InvalidOffsetNumber
+ && offset <= PageGetMaxOffsetNumber(page))
+ {
+ itupdesc = RelationGetDescr(rel);
+ sizetoadd = sizeof(ItemPointerData);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if(_bt_isbinaryequal(itupdesc, olditup,
+ rel->rd_index->indnatts, itup))
+ {
+ if (!BtreeTupleIsPosting(olditup))
+ {
+ nipd = 1;
+ sizetoadd = sizetoadd*2;
+ }
+ else
+ nipd = BtreeGetNPosting(olditup);
+
+ if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+ && PageGetFreeSpace(page) > sizetoadd)
+ updposting = true;
+ }
+ }
+
+ if (updposting)
+ {
+ _bt_pgupdtup(rel, page, offset, itup, true, olditup, nipd);
+ _bt_relbuf(rel, buf);
+ }
+ else
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -1042,6 +1089,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1120,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+
+ /* Truncate posting before insert it as a hikey. */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
@@ -2103,6 +2177,76 @@ _bt_pgaddtup(Page page,
}
/*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ * If new tuple to insert is equal to the tuple that already exists on the page,
+ * we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * concat - this flag shows, whether we should add new item to existing one
+ * or just replace old tuple with the new value. If concat is false, the
+ * following fields are senseless.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+ bool concat, IndexTuple olditup, int nipd)
+{
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Size newitupsz;
+
+ if (concat)
+ {
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+ /*
+ * Update the tuple in place. We have already checked that the
+ * new tuple would fit into this page, so it's safe to delete
+ * old tuple and insert the new one without any side effects.
+ */
+ newitupsz = IndexTupleDSize(*newitup);
+ newitupsz = MAXALIGN(newitupsz);
+ }
+ else
+ {
+ newitup = itup;
+ newitupsz = IndexTupleSize(itup);
+ }
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, offset);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ //TODO add Xlog stuff
+
+ END_CRIT_SECTION();
+
+ if (concat)
+ {
+ pfree(ipd);
+ pfree(newitup);
+ }
+}
+
+/*
* _bt_isequal - used in _bt_doinsert in check for duplicates.
*
* This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2295,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
}
/*
+ * _bt_isbinaryequal - used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+ int nindatts, IndexTuple ituptoinsert)
+{
+ AttrNumber attno;
+
+ for (attno = 1; attno <= nindatts; attno++)
+ {
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+ datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attno <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+ return false;
+ }
+ else
+ {
+ Assert(attno <= itupdesc->natts);
+ att = itupdesc->attrs[attno - 1];
+ if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ return false;
+ }
+ }
+
+ /* if we get here, the keys are equal */
+ return true;
+}
+
+/*
* _bt_vacuum_one_page - vacuum just one index page.
*
* Try to remove LP_DEAD items from the given page. The passed buffer
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..a08c500 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+ ItemPointerData *items,int nitem, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +963,7 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1013,58 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ ItemPointer newipd;
+ int nipd,
+ nnewipd;
+
+ nipd = BtreeGetNPosting(itup);
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /* There are still some live tuples in the posting.
+ * 1) form new posting tuple, that contains remaining ipds
+ * 2) delete "old" posting and insert new posting back to the page
+ */
+ remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ _bt_pgupdtup(info->index, page, offnum, remaining, false, NULL, 0);
+ }
+ else
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1160,3 +1189,50 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..301c019 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1161,6 +1163,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1198,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1219,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1228,7 +1243,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1236,7 +1251,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1246,8 +1261,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1261,8 +1288,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1288,6 +1315,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e46930b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
/*
* Status record for spooling/sorting phase. (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it, because HIKEY
+ * is not considered as real data, and it need not to keep any
+ * ItemPointerData at all. And of course it need not to keep
+ * a list of ipd.
+ * But, if it had a big posting list, there will be plenty of
+ * free space on the opage. In that case we must split posting
+ * tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ IndexTuple keytup;
+ Size keytupsz;
+ int nipd,
+ ntocut,
+ ntoleave;
+
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup,
+ newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+ /* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Save a copy of the minimum key for the new page. We have to copy
* it off the old page, not the new one, in case we are not at leaf
* level.
+ * We can not just copy oitup, because it could be posting tuple
+ * and it's more safe just to get new inserted hikey.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ ScanKey indexScanKey;
+ SortSupport sortKeys;
+ int i;
+
+ /* Prepare SortSupport data for each column */
+ indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport structure for indextuples comparison */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (IsSystemRelation(indexRelation) || index->indisunique)
+ {
+ /* Do not use compression. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples are not equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..8c9dda1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2063,3 +2065,69 @@ btoptions(Datum reloptions, bool validate)
{
return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
}
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..dc82ce7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,6 +122,15 @@ typedef struct BTMetaPageData
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
/*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+
+/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
* For pages above the leaf level, we use a fixed 70% fillfactor.
* The fillfactor is applied during index build and when splitting
@@ -538,6 +547,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -550,7 +561,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -651,6 +662,27 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -684,6 +716,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+ bool concat, IndexTuple olditup, int nipd);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
/*
* prototypes for functions in nbtpage.c
@@ -715,8 +750,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +782,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtvalidate.c
Hi Anastasia,
On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:
18.02.2016 20:18, Anastasia Lubennikova:
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.Sorry, previous patch was dirty. Hotfix is attached.
This looks like an extremely valuable optimization for btree indexes but
unfortunately it is not getting a lot of attention. It still applies
cleanly for anyone interested in reviewing.
It's not clear to me that you answered all of Peter's questions in [1]/messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com.
I understand that you've provided a README but it may not be clear if
the answers are in there (and where).
Also, at the end of the README it says:
13. Xlog. TODO.
Does that mean the patch is not yet complete?
Thanks,
--
-David
david@pgmasters.net
[1]: /messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com
/messages/by-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
14.03.2016 16:02, David Steele:
Hi Anastasia,
On 2/18/16 12:29 PM, Anastasia Lubennikova wrote:
18.02.2016 20:18, Anastasia Lubennikova:
04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it
looks much better.
I described all details of the compression in this document
https://goo.gl/50O8Q0 (the same text without pictures is attached in
btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about
tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.Sorry, previous patch was dirty. Hotfix is attached.
This looks like an extremely valuable optimization for btree indexes
but unfortunately it is not getting a lot of attention. It still
applies cleanly for anyone interested in reviewing.
Thank you for attention.
I would be indebted to all reviewers, who can just try this patch on
real data and workload (except WAL for now).
B-tree needs very much testing.
It's not clear to me that you answered all of Peter's questions in
[1]. I understand that you've provided a README but it may not be
clear if the answers are in there (and where).
I described in README all the points Peter asked.
But I see that it'd be better to answer directly.
Thanks for reminding, I'll do it tomorrow.
Also, at the end of the README it says:
13. Xlog. TODO.
Does that mean the patch is not yet complete?
Yes, you're right.
Frankly speaking, I supposed that someone will help me with that stuff,
but now I almost completed it. I'll send updated patch in the next letter.
I'm still doubtful about some patch details. I mentioned them in readme
(bold type).
But they are mostly about future improvements.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Please, find the new version of the patch attached. Now it has WAL
functionality.
Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0
This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.
Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btree_compression_4.0.patchtext/x-patch; name=btree_compression_4.0.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..72acc0f 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
#include "storage/predicate.h"
#include "utils/tqual.h"
+#include "catalog/catalog.h"
+#include "utils/datum.h"
typedef struct
{
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
int keysz, ScanKey scankey);
+
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
BTStack stack;
Buffer buf;
OffsetNumber offset;
+ Page page;
+ TupleDesc itupdesc;
+ int nipd;
+ IndexTuple olditup;
+ Size sizetoadd;
/* we need an insertion scan key to do our search, so build one */
itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
if (checkUnique != UNIQUE_CHECK_EXISTING)
{
+ bool updposting = false;
/*
* The only conflict predicate locking cares about for indexes is when
* an index tuple insert conflicts with an existing lock. Since the
@@ -201,7 +210,42 @@ top:
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
stack, heapRel);
- _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+ /*
+ * Decide, whether we can apply compression
+ */
+ page = BufferGetPage(buf);
+
+ if(!IsSystemRelation(rel)
+ && !rel->rd_index->indisunique
+ && offset != InvalidOffsetNumber
+ && offset <= PageGetMaxOffsetNumber(page))
+ {
+ itupdesc = RelationGetDescr(rel);
+ sizetoadd = sizeof(ItemPointerData);
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+ if(_bt_isbinaryequal(itupdesc, olditup,
+ rel->rd_index->indnatts, itup))
+ {
+ if (!BtreeTupleIsPosting(olditup))
+ {
+ nipd = 1;
+ sizetoadd = sizetoadd*2;
+ }
+ else
+ nipd = BtreeGetNPosting(olditup);
+
+ if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+ && PageGetFreeSpace(page) > sizetoadd)
+ updposting = true;
+ }
+ }
+
+ if (updposting)
+ _bt_pgupdtup(rel, buf, offset, itup, olditup, nipd);
+ else
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -1042,6 +1086,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemid = PageGetItemId(origpage, P_HIKEY);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+
if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
false, false) == InvalidOffsetNumber)
{
@@ -1072,13 +1117,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
}
- if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+ if (BtreeTupleIsPosting(item))
+ {
+ Size hikeysize = BtreeGetPostingOffset(item);
+ IndexTuple hikey = palloc0(hikeysize);
+
+ /* Truncate posting before insert it as a hikey. */
+ memcpy (hikey, item, hikeysize);
+ hikey->t_info &= ~INDEX_SIZE_MASK;
+ hikey->t_info |= hikeysize;
+ ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+ if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+
+ pfree(hikey);
+ }
+ else
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
- elog(ERROR, "failed to add hikey to the left sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add hikey to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
}
leftoff = OffsetNumberNext(leftoff);
@@ -2103,6 +2174,120 @@ _bt_pgaddtup(Page page,
}
/*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ *
+ * If new tuple to insert is equal to the tuple that already exists
+ * on the page, we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert.
+ * olditup is the old tuple itself.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Buffer buf, OffsetNumber offset, IndexTuple itup,
+ IndexTuple olditup, int nipd)
+{
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Size newitupsz;
+ Page page;
+
+ page = BufferGetPage(buf);
+
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+ /*
+ * Update the tuple in place. We have already checked that the
+ * new tuple would fit into this page, so it's safe to delete
+ * old tuple and insert the new one without any side effects.
+ */
+ newitupsz = IndexTupleDSize(*newitup);
+ newitupsz = MAXALIGN(newitupsz);
+
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, offset);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ uint8 xlinfo;
+ XLogRecPtr recptr;
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ xlrec.offnum = offset;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ /* TODO add some Xlog stuff for inner pages?
+ * Don't sure if we really need it?*/
+ Assert(P_ISLEAF(pageop));
+ xlinfo = XLOG_BTREE_UPDATE_TUPLE;
+
+ /* Read comments in _bt_pgaddtup */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+
+ XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+ recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+}
+
+/*
+ * _bt_pgrewritetup() -- update a tuple in place.
+ * This function is used for handling compressed tuples.
+ * It is used to update compressed tuple after vacuuming.
+ * and to rewirite hikey while building index.
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgrewritetup(Relation rel, Buffer buf, Page page, OffsetNumber offset, IndexTuple itup)
+{
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, offset);
+
+ if (!_bt_pgaddtup(page, IndexTupleSize(itup), itup, offset))
+ elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ END_CRIT_SECTION();
+}
+
+/*
* _bt_isequal - used in _bt_doinsert in check for duplicates.
*
* This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2336,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
}
/*
+ * _bt_isbinaryequal - used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+ int nindatts, IndexTuple ituptoinsert)
+{
+ AttrNumber attno;
+
+ for (attno = 1; attno <= nindatts; attno++)
+ {
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+ Form_pg_attribute att;
+
+ datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+ datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+ /*
+ * If one value is NULL and other is not, then they are certainly not
+ * equal
+ */
+ if (isnull1 != isnull2)
+ return false;
+ /*
+ * We do simple binary comparison of the two datums. This may be overly
+ * strict because there can be multiple binary representations for the
+ * same logical value. But we should be OK as long as there are no false
+ * positives. Using a type-specific equality operator is messy because
+ * there could be multiple notions of equality in different operator
+ * classes; furthermore, we cannot safely invoke user-defined functions
+ * while holding exclusive buffer lock.
+ */
+ if (attno <= 0)
+ {
+ /* The only allowed system columns are OIDs, so do this */
+ if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+ return false;
+ }
+ else
+ {
+ Assert(attno <= itupdesc->natts);
+ att = itupdesc->attrs[attno - 1];
+ if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ return false;
+ }
+ }
+
+ /* if we get here, the keys are equal */
+ return true;
+}
+
+/*
* _bt_vacuum_one_page - vacuum just one index page.
*
* Try to remove LP_DEAD items from the given page. The passed buffer
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 67755d7..53c30d2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -787,15 +787,36 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset, IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ int i;
+ Size itemsz;
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
- /* Fix the page */
+ /* Handle compressed tuples here. */
+ for (i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple.*/
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page.*/
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
+ /* Fix the page.
+ * After dealing with posting tuples,
+ * just delete all tuples to be deleted.
+ */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -824,12 +845,28 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterData((char *) &xlrec_vacuum, SizeOfBtreeVacuum);
/*
+ * Here we should save offnums and remaining tuples themselves.
+ * It's important to restore them in correct order.
+ * At first, we must handle remaining tuples and only after that
+ * other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ int i;
+ XLogRegisterBufData(0, (char *) remainingoffset, nremaining * sizeof(OffsetNumber));
+ for (i = 0; i < nremaining; i++)
+ XLogRegisterBufData(0, (char *) remaining[i], IndexTupleSize(remaining[i]));
+ }
+
+ /*
* The target-offsets array is not in the buffer, but pretend that it
* is. When XLogInsert stores the whole buffer, the offsets array
* need not be stored too.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..39e125f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+ ItemPointerData *items,int nitem, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -861,7 +862,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0, vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -962,6 +963,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -998,6 +1002,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1011,31 +1016,75 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
-
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if(BtreeTupleIsPosting(itup))
+ {
+ ItemPointer newipd;
+ int nipd,
+ nnewipd;
+
+ nipd = BtreeGetNPosting(itup);
+
+ /*
+ * Delete from the posting list all ItemPointers
+ * which are no more valid. newipd contains list of remainig
+ * ItemPointers or NULL if none of the items need to be removed.
+ */
+ newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+ if (newipd != NULL)
+ {
+ if (nnewipd > 0)
+ {
+ /*
+ * There are still some live tuples in the posting.
+ * We should update this tuple in place. It'll be done later
+ * in _bt_delitems_vacuum(). To do that we need to save
+ * information about the tuple. remainingoffset - offset of the
+ * old tuple to be deleted. And new tuple to insert on the same
+ * position, which contains remaining ItemPointers.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BtreeReformPackedTuple(itup, newipd, nnewipd);
+ nremaining++;
+ }
+ else
+ {
+ /*
+ * If all ItemPointers should be deleted,
+ * we can delete this tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1043,7 +1092,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
BlockNumber lastBlockVacuumed = InvalidBlockNumber;
@@ -1070,7 +1119,7 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, remainingoffset, remaining, nremaining,
lastBlockVacuumed);
/*
@@ -1160,3 +1209,50 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+ int nitem, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ ItemPointer tmpitems = NULL;
+ IndexBulkDeleteCallback callback = vstate->callback;
+ void *callback_state = vstate->callback_state;
+
+ /*
+ * Iterate over TIDs array
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (callback(items + i, callback_state))
+ {
+ if (!tmpitems)
+ {
+ /*
+ * First TID to be deleted: allocate memory to hold the
+ * remaining items.
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+ }
+ }
+ else
+ {
+ if (tmpitems)
+ tmpitems[remaining] = items[i];
+ remaining++;
+ }
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 14dffe0..2cb1769 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1134,6 +1136,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
IndexTuple itup;
bool continuescan;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1168,6 +1171,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1188,8 +1192,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
}
if (!continuescan)
{
@@ -1201,7 +1216,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
offnum = OffsetNumberNext(offnum);
}
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1209,7 +1224,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPackedIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1219,8 +1234,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (itup != NULL)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BtreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BtreeGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1234,8 +1261,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1261,6 +1288,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BtreeGetPostingOffset(itup);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..906b9df 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
#include "utils/rel.h"
#include "utils/sortsupport.h"
#include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
/*
* Status record for spooling/sorting phase. (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int _bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(last_off > P_FIRSTKEY);
ii = PageGetItemId(opage, last_off);
oitup = (IndexTuple) PageGetItem(opage, ii);
- _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
/*
- * Move 'last' into the high key position on opage
+ * If the item is PostingTuple, we can cut it, because HIKEY
+ * is not considered as real data, and it need not to keep any
+ * ItemPointerData at all. And of course it need not to keep
+ * a list of ipd.
+ * But, if it had a big posting list, there will be plenty of
+ * free space on the opage. In that case we must split posting
+ * tuple into 2 pieces.
*/
- hii = PageGetItemId(opage, P_HIKEY);
- *hii = *ii;
- ItemIdSetUnused(ii); /* redundant */
- ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ if (BtreeTupleIsPosting(oitup))
+ {
+ IndexTuple keytup;
+ Size keytupsz;
+ int nipd,
+ ntocut,
+ ntoleave;
+
+ nipd = BtreeGetNPosting(oitup);
+ ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+ ntocut++; /* round up to be sure that we cut enough */
+ ntoleave = nipd - ntocut;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(oitup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, oitup, keytupsz);
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+ if (ntocut < nipd)
+ {
+ ItemPointerData *newipd;
+ IndexTuple newitup,
+ newlasttup;
+ /*
+ * 1) Cut part of old tuple to shift to npage.
+ * And insert it as P_FIRSTKEY.
+ * This tuple is based on keytup.
+ * Blkno & offnum are reset in BtreeFormPackedTuple.
+ */
+ newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+ /* Note, that we cut last 'ntocut' items */
+ memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+ newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+ _bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+ pfree(newipd);
+ pfree(newitup);
+
+ /*
+ * 2) set last item to the P_HIKEY linp
+ * Move 'last' into the high key position on opage
+ * NOTE: Do this because of indextuple deletion algorithm, which
+ * doesn't allow to delete an item while we have unused one before it.
+ */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgrewritetup(wstate->index, InvalidBuffer, opage, P_HIKEY, keytup);
+
+ /* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+ newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+ _bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+ pfree(newlasttup);
+ }
+ else
+ {
+ /* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+ /* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+ _bt_pgrewritetup(wstate->index, InvalidBuffer, opage, P_HIKEY, keytup);
+
+ }
+ pfree(keytup);
+ }
+ else
+ {
+ /*
+ * 1) Shift the last tuple to npage.
+ * Insert it as P_FIRSTKEY.
+ */
+ _bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+ /* 2) set last item to the P_HIKEY linp */
+ /* Move 'last' into the high key position on opage */
+ hii = PageGetItemId(opage, P_HIKEY);
+ *hii = *ii;
+ ItemIdSetUnused(ii); /* redundant */
+ ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+ }
/*
* Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey != NULL);
ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
pfree(state->btps_minkey);
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Save a copy of the minimum key for the new page. We have to copy
* it off the old page, not the new one, in case we are not at leaf
* level.
+ * We can not just copy oitup, because it could be posting tuple
+ * and it's more safe just to get new inserted hikey.
*/
- state->btps_minkey = CopyIndexTuple(oitup);
+ ItemId iihk = PageGetItemId(opage, P_HIKEY);
+ IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+ state->btps_minkey = CopyIndexTuple(hikey);
/*
* Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+
+ if (BtreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * 0) Form key tuple, that doesn't contain any ipd.
+ * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+ * any function that uses keytup should handle them itself.
+ */
+ keytupsz = BtreeGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy (keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+ state->btps_minkey = CopyIndexTuple(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup);
}
/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+ ScanKey indexScanKey;
+ SortSupport sortKeys;
+ int i;
+
+ /* Prepare SortSupport data for each column */
+ indexScanKey = _bt_mkscankey_nodata(wstate->index);
+ sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+ for (i = 0; i < keysz; i++)
+ {
+ SortSupport sortKey = sortKeys + i;
+ ScanKey scanKey = indexScanKey + i;
+ int16 strategy;
+
+ sortKey->ssup_cxt = CurrentMemoryContext;
+ sortKey->ssup_collation = scanKey->sk_collation;
+ sortKey->ssup_nulls_first =
+ (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+ sortKey->ssup_attno = scanKey->sk_attno;
+ /* Abbreviation is not supported here */
+ sortKey->abbreviate = false;
+
+ AssertState(sortKey->ssup_attno != 0);
+
+ strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+ BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+ PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+ }
+
+ _bt_freeskey(indexScanKey);
+ return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+ IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+ SortSupport entry;
+ Datum attrDatum1,
+ attrDatum2;
+ bool isNull1,
+ isNull2;
+ int32 compare;
+
+ entry = sortKeys + i - 1;
+ attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+ attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+ compare = ApplySortComparator(attrDatum1, isNull1,
+ attrDatum2, isNull2,
+ entry);
+
+ return compare;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
BTPageState *state = NULL;
bool merge = (btspool2 != NULL);
IndexTuple itup,
- itup2 = NULL;
+ itup2 = NULL,
+ itupprev = NULL;
bool should_free,
should_free2,
load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
keysz = RelationGetNumberOfAttributes(wstate->index);
- ScanKey indexScanKey = NULL;
+ int ntuples = 0;
SortSupport sortKeys;
+ /* Prepare SortSupport structure for indextuples comparison */
+ sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
if (merge)
{
/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
true, &should_free);
itup2 = tuplesort_getindextuple(btspool2->sortstate,
true, &should_free2);
- indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
- /* Prepare SortSupport data for each column */
- sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
- for (i = 0; i < keysz; i++)
- {
- SortSupport sortKey = sortKeys + i;
- ScanKey scanKey = indexScanKey + i;
- int16 strategy;
-
- sortKey->ssup_cxt = CurrentMemoryContext;
- sortKey->ssup_collation = scanKey->sk_collation;
- sortKey->ssup_nulls_first =
- (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
- sortKey->ssup_attno = scanKey->sk_attno;
- /* Abbreviation is not supported here */
- sortKey->abbreviate = false;
-
- AssertState(sortKey->ssup_attno != 0);
-
- strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
- BTGreaterStrategyNumber : BTLessStrategyNumber;
-
- PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
- }
-
- _bt_freeskey(indexScanKey);
for (;;)
{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
{
for (i = 1; i <= keysz; i++)
{
- SortSupport entry;
- Datum attrDatum1,
- attrDatum2;
- bool isNull1,
- isNull2;
- int32 compare;
-
- entry = sortKeys + i - 1;
- attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
- attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
- compare = ApplySortComparator(attrDatum1, isNull1,
- attrDatum2, isNull2,
- entry);
+ int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
if (compare > 0)
{
load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
else
{
/* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ Relation indexRelation = wstate->index;
+ Form_pg_index index = indexRelation->rd_index;
+
+ if (IsSystemRelation(indexRelation) || index->indisunique)
+ {
+ /* Do not use compression. */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
+
+ _bt_buildadd(wstate, state, itup);
+ if (should_free)
+ pfree(itup);
+ }
+ }
+ else
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ ItemPointerData *ipd = NULL;
+ IndexTuple postingtuple;
+ Size maxitemsize = 0,
+ maxpostingsize = 0;
- _bt_buildadd(wstate, state, itup);
- if (should_free)
- pfree(itup);
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true, &should_free)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ /*
+ * Compare current tuple with previous one.
+ * If tuples are equal, we can unite them into a posting list.
+ */
+ if (itupprev != NULL)
+ {
+ if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+ {
+ /* Tuples are equal. Create or update posting */
+ if (ntuples == 0)
+ {
+ /*
+ * We haven't suitable posting list yet, so allocate
+ * it and save both itupprev and current tuple.
+ */
+ ipd = palloc0(maxitemsize);
+
+ memcpy(ipd, itupprev, sizeof(ItemPointerData));
+ ntuples++;
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+ {
+ memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+ ntuples++;
+ }
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+
+ }
+ else
+ {
+ /* Tuples are not equal. Insert itupprev into index. */
+ if (ntuples == 0)
+ _bt_buildadd(wstate, state, itupprev);
+ else
+ {
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ itupprev = CopyIndexTuple(itup);
+ if (should_free)
+ pfree(itup);
+
+ /* compute max size of ipd list */
+ maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+ }
+
+ /* Handle the last item.*/
+ if (ntuples == 0)
+ {
+ if (itupprev != NULL)
+ _bt_buildadd(wstate, state, itupprev);
+ }
+ else
+ {
+ Assert(ipd!=NULL);
+ Assert(itupprev != NULL);
+ postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+ _bt_buildadd(wstate, state, postingtuple);
+ ntuples = 0;
+ pfree(ipd);
+ }
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index b714b2c..53fcbcc 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1814,7 +1814,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BtreeTupleIsPosting(ituple)
+ && (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2056,3 +2058,69 @@ btoptions(Datum reloptions, bool validate)
{
return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
}
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ uint32 newsize;
+ IndexTuple itup = CopyIndexTuple(tuple);
+
+ /*
+ * Determine and store offset to the posting list.
+ */
+ newsize = IndexTupleSize(itup);
+ newsize = SHORTALIGN(newsize);
+
+ /*
+ * Set meta info about the posting list.
+ */
+ BtreeSetPostingOffset(itup, newsize);
+ BtreeSetNPosting(itup, nipd);
+ /*
+ * Add space needed for posting list, if any. Then check that the tuple
+ * won't be too big to store.
+ */
+ newsize += sizeof(ItemPointerData)*nipd;
+ newsize = MAXALIGN(newsize);
+
+ /*
+ * Resize tuple if needed
+ */
+ if (newsize != IndexTupleSize(itup))
+ {
+ itup = repalloc(itup, newsize);
+
+ /*
+ * PostgreSQL 9.3 and earlier did not clear this new space, so we
+ * might find uninitialized padding when reading tuples from disk.
+ */
+ memset((char *) itup + IndexTupleSize(itup),
+ 0, newsize - IndexTupleSize(itup));
+ /* set new size in tuple header */
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+ }
+
+ /*
+ * Copy data into the posting tuple
+ */
+ memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+ return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+ int size;
+ if (BtreeTupleIsPosting(tuple))
+ {
+ size = BtreeGetPostingOffset(tuple);
+ tuple->t_info &= ~INDEX_SIZE_MASK;
+ tuple->t_info |= size;
+ }
+
+ return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0d094ca..6ced76c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -475,14 +475,40 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ OffsetNumber *offset;
+ IndexTuple remaining;
+ int i;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ offset = (OffsetNumber *) ptr;
+ remaining = (IndexTuple)(ptr + xlrec->nremaining*sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for(i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, offset[i]);
+
+ itemsz = IndexTupleSize(remaining);
+ itemsz = MAXALIGN(itemsz);
+
+ if (PageAddItem(page, (Item) remaining, itemsz, offset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ remaining = (IndexTuple)((char*) remaining + itemsz);
+ }
+
+ if (xlrec->ndeleted > 0)
+ {
+ OffsetNumber *unused;
+ OffsetNumber *unend;
+
+ unused = (OffsetNumber *) ((char *)remaining);
+ unend = (OffsetNumber *) ((char *) ptr + len);
+
+ if ((unend - unused) > 0)
+ PageIndexMultiDelete(page, unused, unend - unused);
+ }
}
/*
@@ -713,6 +739,75 @@ btree_xlog_delete(XLogReaderState *record)
UnlockReleaseBuffer(buffer);
}
+/*
+ * Applies changes performed by _bt_pgupdtup().
+ * TODO Add some stuff for inner pages. Don't sure if we really need it?
+ * See comment in _bt_pgupdtup().
+ */
+static void
+btree_xlog_update(bool isleaf, XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+
+ if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+ {
+ Size datalen;
+ char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ ItemPointerData *ipd;
+ IndexTuple olditup,
+ newitup;
+ Size newitupsz;
+ int nipd;
+
+ /*TODO Following code needs some refactoring. Maybe one more function.*/
+ page = BufferGetPage(buffer);
+
+ olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, xlrec->offnum));
+
+ if (!BtreeTupleIsPosting(olditup))
+ nipd = 1;
+ else
+ nipd = BtreeGetNPosting(olditup);
+
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+ /* copy item pointers from old tuple into ipd */
+ if (BtreeTupleIsPosting(olditup))
+ memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+ else
+ memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+nipd, (Item) datapos, sizeof(ItemPointerData));
+
+ newitup = BtreeReformPackedTuple((Item) datapos, ipd, nipd+1);
+
+ /*
+ * Update the tuple in place. We have already checked that the
+ * new tuple would fit into this page, so it's safe to delete
+ * old tuple and insert the new one without any side effects.
+ */
+ newitupsz = IndexTupleDSize(*newitup);
+ newitupsz = MAXALIGN(newitupsz);
+
+ PageIndexTupleDelete(page, xlrec->offnum);
+
+ if (PageAddItem(page, (Item) newitup, newitupsz, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "failed to update compressed tuple while doing recovery");
+
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+
+ if (BufferIsValid(buffer))
+ UnlockReleaseBuffer(buffer);
+}
+
static void
btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
{
@@ -988,6 +1083,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_INSERT_META:
btree_xlog_insert(false, true, record);
break;
+ case XLOG_BTREE_UPDATE_TUPLE:
+ btree_xlog_update(true, record);
+ break;
case XLOG_BTREE_SPLIT_L:
btree_xlog_split(true, false, record);
break;
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
-
/* routines in indextuple.c */
extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9046b16..5496e94 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,6 +122,15 @@ typedef struct BTMetaPageData
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
/*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+ (sizeof(ItemPointerData))))
+
+/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
* For pages above the leaf level, we use a fixed 70% fillfactor.
* The fillfactor is applied during index build and when splitting
@@ -226,6 +235,7 @@ typedef struct BTMetaPageData
* vacuum */
#define XLOG_BTREE_REUSE_PAGE 0xD0 /* old page is about to be reused from
* FSM */
+#define XLOG_BTREE_UPDATE_TUPLE 0xE0 /* update index tuple in place */
/*
* All that we need to regenerate the meta-data page
@@ -348,15 +358,31 @@ typedef struct xl_btree_reuse_page
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
* have a zero length array of offsets. Earlier records must have at least one.
+ * TODO: update this comment
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This filed helps us to find beginning of the remining tuples
+ * which follow array of offset numbers.
+ */
+ int nremaining;
+
+ /*
+ * TODO: Don't sure if we do need following variable,
+ * maybe just a flag would be enough to determine
+ * if there is some data about deleted tuples
+ */
+ int ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber) + 2*sizeof(int))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -538,6 +564,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for Posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -550,7 +578,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -650,6 +678,27 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+ BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+ ((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup) BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n) ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n) ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup) (BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup) (ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n) (ItemPointerData*) (BtreeGetPosting(itup) + n)
+
/*
* prototypes for functions in nbtree.c (external entry points for btree)
*/
@@ -683,6 +732,10 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Buffer buf, OffsetNumber offset, IndexTuple itup,
+ IndexTuple olditup, int nipd);
+extern void _bt_pgrewritetup(Relation rel, Buffer buf, Page page, OffsetNumber offset, IndexTuple itup);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
/*
* prototypes for functions in nbtpage.c
@@ -702,6 +755,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -714,8 +769,8 @@ extern BTStack _bt_search(Relation rel,
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+ ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -746,6 +801,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
extern Size BTreeShmemSize(void);
extern void BTreeShmemInit(void);
extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
/*
* prototypes for functions in nbtvalidate.c
On 18.03.2016 20:19, Anastasia Lubennikova wrote:
Please, find the new version of the patch attached. Now it has WAL
functionality.Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any
feedback.
But please, don't complain about code style, it is still work in
progress.Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.
Hi, hackers.
It's my first review, so do not be strict to me.
I have tested this patch on the next table:
create table message
(
id serial,
usr_id integer,
text text
);
CREATE INDEX message_usr_id ON message (usr_id);
The table has 10000000 records.
I found the following:
The less unique keys the less size of the table.
Next 2 tablas demonstrates it.
New B-tree
Count of unique keys (usr_id), index�s size , time of creation
10000000 ;"214 MB" ;"00:00:34.193441"
3333333 ;"214 MB" ;"00:00:45.731173"
2000000 ;"129 MB" ;"00:00:41.445876"
1000000 ;"129 MB" ;"00:00:38.455616"
100000 ;"86 MB" ;"00:00:40.887626"
10000 ;"79 MB" ;"00:00:47.199774"
Old B-tree
Count of unique keys (usr_id), index�s size , time of creation
10000000 ;"214 MB" ;"00:00:35.043677"
3333333 ;"286 MB" ;"00:00:40.922845"
2000000 ;"300 MB" ;"00:00:46.454846"
1000000 ;"278 MB" ;"00:00:42.323525"
100000 ;"287 MB" ;"00:00:47.438132"
10000 ;"280 MB" ;"00:01:00.307873"
I inserted data randomly and sequentially, it did not influence the
index's size.
Time of select, insert and update random rows is not changed. It is
great, but certainly it needs some more detailed study.
Alexander Popov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Please, find the new version of the patch attached. Now it has WAL
functionality.Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0This patch is pretty complicated, so I ask everyone, who interested in this
feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message couple of
sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression for
each particular index.
4. Recheck locking considerations. I tried to write code as less invasive as
possible, but we need to make sure that algorithm is still correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.
I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Please, find the new version of the patch attached. Now it has WAL
functionality.Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0This patch is pretty complicated, so I ask everyone, who interested in
this
feature,
to help with reviewing and testing it. I will be grateful for anyfeedback.
But please, don't complain about code style, it is still work in
progress.
Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next messagecouple of
sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compressionfor
each particular index.
4. Recheck locking considerations. I tried to write code as lessinvasive as
possible, but we need to make sure that algorithm is still correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.
It's all true. But:
1) It's a great feature many users dream about.
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes. It just
change some well-isolated placed.
Let's give it a chance. I've signed as additional reviewer and I'll do my
best in spotting all possible issues in this patch.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.
Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.
I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/24/16 10:21 AM, Alexander Korotkov wrote:
1) It's a great feature many users dream about.
Doesn't matter if it starts eating their data...
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes. It just
change some well-isolated placed.
It doesn't really matter how big the patch is, it's a question of "What
did the patch fail to consider?". With something as complicated as the
btree code, there's ample opportunities for missing things. (And FWIW,
I'd argue that a 51kB patch is certainly not small, and a patch that is
doing things in critical sections isn't terribly isolated).
I do think this will be a great addition, but it's just too late to be
adding this to 9.6.
(BTW, I'm getting bounces from a.lebedev@postgrespro.ru, as well as
postmaster@. I emailed info@postgrespro.ru about this but never heard back.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
25.03.2016 01:12, Peter Geoghegan:
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.
You're right.
Frankly, I thought that someone will help me with the path, but I had to
finish it myself.
*off-topic*
I wonder, if we can add new flag to commitfest. Something like "Needs
assistance",
which will be used to mark big and complicated patches in progress.
While "Needs review" means that the patch is almost ready and only
requires the final review.
This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.
Unfortunately, I must agree. This patch seems to be far from final
version until the feature freeze.
I'll move it to the future commitfest.
Anyway it means, that now we have more time to improve the patch.
If you have any ideas related to this patch like prefix/suffix
compression, I'll be glad to discuss them.
Same for any other ideas of B-tree optimization.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.
I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.
Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.
Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 18/03/16 19:19, Anastasia Lubennikova wrote:
Please, find the new version of the patch attached. Now it has WAL
functionality.Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.
I think we should pack the TIDs more tightly, like GIN does with the
varbyte encoding. It's tempting to commit this without it for now, and
add the compression later, but I'd like to avoid having to deal with
multiple binary-format upgrades, so let's figure out the final on-disk
format that we want, right from the beginning.
It would be nice to reuse the varbyte encoding code from GIN, but we
might not want to use that exact scheme for B-tree. Firstly, an
important criteria when we designed GIN's encoding scheme was to avoid
expanding on-disk size for any data set, which meant that a TID had to
always be encoded in 6 bytes or less. We don't have that limitation with
B-tree, because in B-tree, each item is currently stored as a separate
IndexTuple, which is much larger. So we are free to choose an encoding
scheme that's better at packing some values, at the expense of using
more bytes for other values, if we want to. Some analysis on what we
want would be nice. (It's still important that removing a TID from the
list never makes the list larger, for VACUUM.)
Secondly, to be able to just always enable this feature, without a GUC
or reloption, we might need something that's faster for random access
than GIN's posting lists. Or can we just add the setting, but it would
be nice to have some more analysis on the worst-case performance before
we decide on that.
I find the macros in nbtree.h in the patch quite confusing. They're
similar to what we did in GIN, but again we might want to choose
differently here. So some discussion on the desired IndexTuple layout is
in order. (One clear bug is that using the high bit of BlockNumber for
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I think we should pack the TIDs more tightly, like GIN does with the varbyte
encoding. It's tempting to commit this without it for now, and add the
compression later, but I'd like to avoid having to deal with multiple
binary-format upgrades, so let's figure out the final on-disk format that we
want, right from the beginning.
While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):
"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page. This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up). Therefore we assume
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page. An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page. (Currently, we try to find a page where
there is room for the new key without a split.)
"""
If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L&Y originally described.
The practical benefits to this would be:
* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)
* This would make suffix truncation of internal index tuples easier,
and that's important.
The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).
The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).
Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.
[feature overview]
This patch implements the deduplication of btree non-pivot tuples on
leaf pages
in a manner similar to GIN index "posting lists".
Non-pivot posting tuple has following format:
t_tid | t_info | key values | posting_list[]
Where t_tid and t_info fields are used to store meta info
about tuple's posting list.
posting list is an array of ItemPointerData.
Currently, compression is applied to all indexes except system indexes,
unique
indexes, and indexes with included columns.
On insertion, compression applied not to each tuple, but to the page before
split. If the target page is full, we try to compress it.
[benchmark results]
idx ON tbl(c1);
index contains 10000000 integer values
i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.
i / old size (MB) / new size (MB)
1��� ��� ��� 215��� 88
1000��� ��� 215��� 90
100000��� ��� 215��� 71
10000000��� 214��� 214
For more, see the attached diagram with test results.
[future work]
Many things can be improved in this feature.
Personally, I'd prefer to keep this patch as small as possible
and work on other improvements after a basic part is committed.
Though, I understand that some of these can be considered essential
for this patch to be approved.
1. Implement a split of the posting tuples on a page split.
2. Implement microvacuum of posting tuples.
3. Add a flag into pg_index, which allows enabling/disabling compression
for a particular index.
4. Implement posting list compression.
--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
btree_compression_pg12_v1.patchtext/x-patch; name=btree_compression_pg12_v1.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..fce499b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
}
/*
+ * Before considering moving right, try to compress the page
+ */
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+ /*
* Nope, so check conditions (b) and (c) enumerated above
*
* The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,232 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ elog(DEBUG4, "insert_itupprev_to_page. failed");
+ /*
+ * this may happen if tuple is bigger than freespace
+ * fallback to uncompressed page case
+ */
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ return false;
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+ return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns,
+ * system indexes and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel))
+ && (!IsSystemRelation(rel))
+ && (!rel->rd_index->indisunique));
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page
+ * that has already contain mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+ /*
+ * If we have only 10 uncompressed items on the full page,
+ * it probably won't worth to compress them.
+ */
+ if (maxoff - n_posting_on_page < 10)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /* Iterate over tuples on the page, try to compress them into posting lists
+ * and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this
+ * function is called right after _bt_vacuum_one_page().
+ * If for some reason we found dead item, don't compress it,
+ * to allow upcoming microvacuum or vacuum clean it up.
+ */
+ if(ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(rel,
+ compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup)?BTreeTupleGetNPosting(itup):1;
+
+ if (n_equal_atts > natts)
+ {
+ /* Tuples are equal. Create or update posting. */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples+1)*sizeof(ItemPointerData)))))
+ add_item_to_posting(compressState, itup);
+ else
+ /* If posting is too big, insert it on page and continue.*/
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev
+ * to compare it with the following tuple
+ * and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item.*/
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+ return;
+ }
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+ return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index de4d4ef..681077f 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1024,14 +1024,54 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ int i;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple.*/
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page.*/
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1061,6 +1101,9 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
+
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1074,6 +1117,20 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves.
+ * It's important to restore them in correct order.
+ * At first, we must handle remaining tuples and only after that
+ * other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac..5a7d7bd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,8 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
-
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+ IndexTuple itup, int *nremaining);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
* and callbacks.
@@ -1069,7 +1069,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0, vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1193,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1232,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1246,77 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted,
+ * we can delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain.
+ * Do nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. That is
+ * only true as long as the callback function depends only
+ * upon whether the index tuple refers to heap tuples removed
+ * in the initial heap scan. When vacuum starts it derives a
+ * value of OldestXmin. Backends taking later snapshots could
+ * have a RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the index
+ * tuple state in isolation and decide to delete the index
+ * tuple, though currently it does not. If it ever did, we
+ * would need to reconsider whether XLOG_BTREE_VACUUM records
+ * should cause conflicts. If they did cause conflicts they
+ * would be fairly harsh conflicts, since we haven't yet
+ * worked out a way to pass a useful value for
+ * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+ * applies to *any* type of index that marks index tuples as
+ * killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1324,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1341,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1427,43 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list,
+ * save alive tuples into tmpitems
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..594936d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -1410,6 +1412,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
bool continuescan;
int indnatts;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1459,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1494,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1542,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1550,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1592,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1589,8 +1621,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1635,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1649,34 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
@@ -2221,6 +2283,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/* OK, itemIndex says what to return */
currItem = &so->currPos.items[so->currPos.itemIndex];
+
scan->xs_heaptid = currItem->heapTid;
if (scan->xs_want_itup)
scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..59f702b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "catalog/index.h"
#include "commands/progress.h"
#include "miscadmin.h"
@@ -76,6 +77,7 @@
#include "utils/tuplesort.h"
+
/* Magic numbers for parallel state sharing */
#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
@@ -288,6 +290,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+ BTPageState *state, BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple,
+ * we'll truncate its posting list in _bt_truncate as well.
+ * Note that it is also applicable only to leaf pages,
+ * since internal pages never contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(!BTreeTupleIsPosting(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1050,8 +1060,35 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
- /* _bt_sortaddtup() will perform full truncation later */
+
+ /*
+ * Stashed copy must be a non-posting tuple,
+ * with truncated posting list and correct t_tid
+ * since we're going to use it to build downlink.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * Form key tuple, that doesn't contain any ipd.
+ * NOTE: since we'll need TID later, set t_tid to
+ * the first t_tid from posting list.
+ */
+ keytupsz = BTreeTupleGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy(keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+ state->btps_minkey = CopyIndexTuple(keytup);
+ pfree(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup); /* _bt_sortaddtup() will perform full truncation later */
+
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1174,87 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData)*nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup), sizeof(ItemPointerData)*nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1268,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns,
+ * system indexes and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index))
+ && (!IsSystemRelation(wstate->index))
+ && (!wstate->index->rd_index->indisunique));
if (merge)
{
@@ -1266,19 +1396,83 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /* Tuples are equal. Create or update posting. */
+ if ((compressState->ntuples+1)*sizeof(ItemPointerData) < compressState->maxpostingsize)
+ add_item_to_posting(compressState, itup);
+ else
+ /* If posting is too big, insert it on page and continue.*/
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one
+ * and maybe unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..8b77b69 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,26 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal.
+ * But the tuple is a compressed tuple with a posting list,
+ * so we still must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2237,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2248,9 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2272,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid, BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2362,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2451,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2506,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2533,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2585,7 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2612,59 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize, newsize;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert (nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 6532a25..16224b4 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,36 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple)((char*) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..57ee21e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * BTREE_VERSION 5 introduced new format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,149 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple
+ * or non-pivot posting tuple,
* and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
* tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
-/* Get/set downlink block number */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,15 +484,18 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
+
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +503,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +514,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +526,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling*/
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -813,7 +984,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
-
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple,
+ ItemPointerData *ipd, int nipd);
/*
* prototypes for functions in nbtvalidate.c
*/
@@ -825,5 +997,6 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
-
+extern void add_item_to_posting(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..c213bfa 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -172,11 +172,19 @@ typedef struct xl_btree_reuse_page
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
+ /*
+ * This field helps us to find beginning of the remaining tuples
+ * from postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.i / old size (MB) / new size (MB)
1 215 88
1000 215 90
100000 215 71
10000000 214 214For more, see the attached diagram with test results.
I tried this on my own "UK land registry" test data [1]https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com, which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.
On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!
I also tried it out with the "Mouse genome informatics" database [2]http://www.informatics.jax.org/software.shtml -- Peter Geoghegan,
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.
I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.
[1]: https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
[2]: http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
--
Peter Geoghegan
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:
I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!
I also see a huge and consistent space saving for TPC-H. All 9 indexes
are significantly smaller. The lineitem orderkey index is "just" 1/3
smaller, which is the smallest improvement among TPC-H indexes in my
index bloat test case. The two largest indexes after the initial bulk
load are *much* smaller: the lineitem parts supplier index is ~2.7x
smaller, while the lineitem ship date index is a massive ~4.2x
smaller. Also, the orders customer key index is ~2.8x smaller, and the
order date index is ~2.43x smaller. Note that the test involved retail
insertions, not CREATE INDEX.
I haven't seen any regression in the size of any index so far,
including when the number of internal pages is all that we measure.
Actually, there seems to be cases where there is a noticeably larger
reduction in internal pages than in leaf pages, probably because of
interactions with suffix truncation.
This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.
--
Peter Geoghegan
On Thu, Jul 4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote:
This result is very impressive. We'll need to revisit what the right
trade-off is for the compression scheme, which Heikki had some
thoughts on when we left off 3 years ago, but that should be a lot
easier now. I am very encouraged by the fact that this relatively
simple approach already works quite nicely. It's also great to see
that bulk insertions with lots of compression are very clearly faster
with this latest revision of your patch, unlike earlier versions from
2016 that made those cases slower (though I haven't tested indexes
that don't really use compression). I think that this is because you
now do the compression lazily, at the point where it looks like we may
need to split the page. Previous versions of the patch had to perform
compression eagerly, just like GIN, which is not really appropriate
for nbtree.
I am also encouraged and am happy we can finally move this duplicate
optimization forward.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.
I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.
Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.
The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.
I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:
/*
* Lehman and Yao require that the downlink to the right page, which is to
* be inserted into the parent page in the second phase of a page split be
* a strict lower bound on items on the right page, and a non-strict upper
* bound for items on the left page. Assert that heap TIDs follow these
* invariants, since a heap TID value is apparently needed as a
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
BTreeTupleGetMinTID(firstright)) < 0);
...
This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)
Other feedback on specific things:
* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.
* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.
* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix truncation). I see that the
tests fail when that restriction is removed, but I don't think that
that has anything to do with system catalogs. I think that that's due
to a bug somewhere else. Why have this restriction at all?
* It looks like we could be less conservative in nbtsplitloc.c to good
effect. We know for sure that a posting list will be truncated down to
one heap TID even in the worst case, so we can safely assume that the
new high key will be a lot smaller than the firstright tuple that it
is based on when it has a posting list. We only have to keep one TID.
This will allow us to leave more tuples on the left half of the page
in certain cases, further improving space utilization.
* Don't you need to update nbtdesc.c?
* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.
* It looks like amcheck's heapallindexed verification needs to have
normalization added, to avoid false positives. This situation is
specifically anticipated by existing comments above
bt_normalize_tuple(). Again, being careful about "logical versus
physical tuple" seems necessary.
* Doesn't the nbtsearch.c/_bt_readpage() code that deals with
backwards scans need to return posting lists backwards, not forwards?
It seems like a good idea to try to "preserve the logical contents"
here too, just to be conservative.
Within nbtsort.c:
* Is the new code in _bt_buildadd() actually needed? If so, why?
* insert_itupprev_to_page_buildadd() is only called within nbtsort.c,
and so should be static. The name also seems very long.
* add_item_to_posting() is called within both nbtsort.c and
nbtinsert.c, and so should remain non-static, but have less generic
(and shorter) name. (Use the usual _bt_* style instead.)
* Is nbtsort.c the right place for these functions, anyway? (Maybe,
but maybe not, IMV.)
I ran pgindent on the patch, and made some small manual whitespace
adjustments, which is attached. There are no real changes, but some of
the formatting in the original version you posted was hard to read.
Please work off this for your next revision.
--
Peter Geoghegan
Attachments:
0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patchapplication/octet-stream; name=0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patchDownload
From b66157e0ec6aedca19bb4d91a67bff275780c11b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 4 Jul 2019 09:48:51 -0700
Subject: [PATCH 1/4] btree_compression_pg12_v1.patch with pg_indent run
---
src/backend/access/nbtree/nbtinsert.c | 252 ++++++++++++++++++++++++++
src/backend/access/nbtree/nbtpage.c | 54 ++++++
src/backend/access/nbtree/nbtree.c | 143 ++++++++++++---
src/backend/access/nbtree/nbtsearch.c | 78 +++++++-
src/backend/access/nbtree/nbtsort.c | 228 +++++++++++++++++++++--
src/backend/access/nbtree/nbtutils.c | 119 +++++++++++-
src/backend/access/nbtree/nbtxlog.c | 35 +++-
src/include/access/itup.h | 5 +
src/include/access/nbtree.h | 197 ++++++++++++++++++--
src/include/access/nbtxlog.h | 13 +-
10 files changed, 1046 insertions(+), 78 deletions(-)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f8849d4..600dafe73a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -805,6 +814,11 @@ _bt_findinsertloc(Relation rel,
break; /* OK, now we have enough space */
}
+ /*
+ * Before considering moving right, try to compress the page
+ */
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+
/*
* Nope, so check conditions (b) and (c) enumerated above
*
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+ /*
+ * this may happen if tuple is bigger than freespace fallback to
+ * uncompressed page case
+ */
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ return false;
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+ return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel))
+ && (!IsSystemRelation(rel))
+ && (!rel->rd_index->indisunique));
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page that has already contain
+ * mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+
+ /*
+ * If we have only 10 uncompressed items on the full page, it probably
+ * won't worth to compress them.
+ */
+ if (maxoff - n_posting_on_page < 10)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ add_item_to_posting(compressState, itup);
+ }
+ else if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index. Save
+ * current tuple for the next iteration.
+ */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+ return;
+ }
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+ return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db9af..dff506d595 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ int i;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..11e45c891d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1430,42 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..1d36035253 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -1410,6 +1413,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
bool continuescan;
int indnatts;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1460,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1495,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1543,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1551,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1593,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1589,8 +1622,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1636,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1650,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013caf..955a6285ef 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "catalog/index.h"
#include "commands/progress.h"
#include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+ BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(!BTreeTupleIsPosting(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
- /* _bt_sortaddtup() will perform full truncation later */
+
+ /*
+ * Stashed copy must be a non-posting tuple, with truncated posting
+ * list and correct t_tid since we're going to use it to build
+ * downlink.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+ * need TID later, set t_tid to the first t_tid from posting list.
+ */
+ keytupsz = BTreeTupleGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy(keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+ state->btps_minkey = CopyIndexTuple(keytup);
+ pfree(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup); /* _bt_sortaddtup() will
+ * perform full
+ * truncation later */
+
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1136,6 +1174,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index))
+ && (!IsSystemRelation(wstate->index))
+ && (!wstate->index->rd_index->indisunique));
if (merge)
{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ add_item_to_posting(compressState, itup);
+ else
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..22ffcbc8be 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,59 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4726..7daadc9cd5 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..85ee040428 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..0749e64b11 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * BTREE_VERSION 5 introduced new format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,148 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +483,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +493,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +501,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +512,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +524,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -567,6 +732,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +746,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +930,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -813,6 +982,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +996,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc86ea..6f60ca5f7b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
--
2.17.1
On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.
I have some high level concerns about how the patch might increase
contention, which could make queries slower. Apparently that is a real
problem in other systems that use MVCC when their bitmap index feature
is used -- they are never really supposed to be used with OLTP apps.
This patch makes nbtree behave rather a lot like a bitmap index.
That's not exactly true, because you're not storing a bitmap or
compressing the TID lists, but they're definitely quite similar. It's
easy to imagine a hybrid approach, that starts with a B-Tree with
deduplication/TID lists, and eventually becomes a bitmap index as more
duplicates are added [1]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf.
It doesn't seem like it would be practical for these other MVCC
database systems to have standard B-Tree secondary indexes that
compress duplicates gracefully in the way that you propose to with
this patch, because lock contention would presumably be a big problem
for the same reason as it is with their bitmap indexes (whatever the
true reason actually is). Is it really possible to have something
that's adaptive, offering the best of both worlds?
Having dug into it some more, I think that the answer for us might
actually be "yes, we can have it both ways". Other database systems
that are also based on MVCC still probably use a limited form of index
locking, even in READ COMMITTED mode, though this isn't very widely
known. They need this for unique indexes, but they also need it for
transaction rollback, to remove old entries from the index when the
transaction must abort. The section "6.7 Standard Practice" from the
paper "Architecture of a Database System" [2]http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan goes into this, saying:
"All production databases today support ACID transactions. As a rule,
they use write-ahead logging for durability, and two-phase locking for
concurrency control. An exception is PostgreSQL, which uses
multiversion concurrency control throughout."
I suggest reading "6.7 Standard Practice" in full.
Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!
Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.
[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf
[2]: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan
--
Peter Geoghegan
On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:
Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.
Wow, I never thought of that. The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes. By definition, indexes with many
duplicates are not unique, so that doesn't apply.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Hi Peter,
Thank you very much for your attention to this patch. Let me comment
some points of your message.
On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.
Could you please elaborate more on preserving the logical contents? I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page. Is my understanding correct? But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does). Do you mean we need an API for accessing
logical B-tree tuples or something?
The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.
I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans. Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.
I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:/*
* Lehman and Yao require that the downlink to the right page, which is to
* be inserted into the parent page in the second phase of a page split be
* a strict lower bound on items on the right page, and a non-strict upper
* bound for items on the left page. Assert that heap TIDs follow these
* invariants, since a heap TID value is apparently needed as a
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
BTreeTupleGetMinTID(firstright)) < 0);
...This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)
Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs? If so, it's
definitely wrong.
* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.
I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.
Good finding, thank you!
BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples. However, this
doesn't seem like something we should much care about. Providing an
option to turn deduplication off looks enough for me.
Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan <pg@bowt.ie> wrote:
* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
+1
I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.
Going through the patch, here are a few comments from me,
/* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.
/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;
Is this a magic number...?
/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;
This makes me wonder about those 'some' reasons.
Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects
This can be re-framed to make the caller more gender neutral.
Other than that, I am curious about the plans for its backward compatibility.
--
Regards,
Rafia Sabih
Import Notes
Reply to msg id not found: CA+FpmFfG6nsjE9BbPhU95SXhBon3mtdmMhwMeo3SEiAwjKuD3Q@mail.gmail.com
On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian <bruce@momjian.us> wrote:
Wow, I never thought of that. The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes. By definition, indexes with many
duplicates are not unique, so that doesn't apply.
Right. Another advantage of their approach is that you can make
queries like this work:
UPDATE tab SET unique_col = unique_col + 1
This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.
The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).
--
Peter Geoghegan
On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Could you please elaborate more on preserving the logical contents? I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
That's exactly what I mean.
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page. Is my understanding correct?
Yes.
But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does). Do you mean we need an API for accessing
logical B-tree tuples or something?
Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.
The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.
I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans. Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.
We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.
Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs? If so, it's
definitely wrong.
Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.
* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.
I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.
--
Peter Geoghegan
On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples. However, this
doesn't seem like something we should much care about. Providing an
option to turn deduplication off looks enough for me.
There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).
I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.
Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.
I thought that we might implement them in nbtree myself. But we don't
need to decide now.
--
Peter Geoghegan
On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.
The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + and other such DEBUG4 statements are meant to be removed, right...?
I hope so too.
/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;Is this a magic number...?
I think that this should be a constant or something.
/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;This makes me wonder about those 'some' reasons.
I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.
Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expectsThis can be re-framed to make the caller more gender neutral.
Agreed. I also don't like anthropomorphizing code like this.
Other than that, I am curious about the plans for its backward compatibility.
Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.
--
Peter Geoghegan
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
I think unique indexes may benefit from deduplication not only because
of NULL values. Non-HOT updates produce duplicates of non-NULL values
in unique indexes. And those duplicates can take significant space.I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.
I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.
Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.
In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.
Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.
The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).
As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:
/messages/by-id/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ@mail.gmail.com
I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.
--
Peter Geoghegan
11.07.2019 21:19, Peter Geoghegan wrote:
On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
Hi,
Peter, Rafia, thanks for the review. New version is attached.
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + and other such DEBUG4 statements are meant to be removed, right...?I hope so too.
Yes, these messages are only for debugging.
I haven't delete them since this is still work in progress
and it's handy to be able to print inner details.
Maybe I should also write a patch for pageinspect.
/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;Is this a magic number...?
I think that this should be a constant or something.
Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the
value.
When the code will stabilize we can benchmark it and find optimal value.
/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;This makes me wonder about those 'some' reasons.
I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.
You are right, now it is impossible to meet dead items in this function.
Though it can change in the future if, for example, _bt_vacuum_one_page
will behave lazily.
So this is just a sanity check. Maybe it's worth to move it to Assert.
Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expectsThis can be re-framed to make the caller more gender neutral.
Agreed. I also don't like anthropomorphizing code like this.
Fixed.
Other than that, I am curious about the plans for its backward compatibility.
Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.
Current implementation is backward compatible.
To distinguish posting tuples, it only adds one new flag combination.
This combination was never possible before. Comment about version 5 is
deleted.
I also added a patch for amcheck.
There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this
invariant lazily.
Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
implement
insertion into the middle of the posting list. I'll send a new version
this week.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
0001-btree_compression_pg12_v2.patchtext/x-patch; name=0001-btree_compression_pg12_v2.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+ int i;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
}
/*
+ * Before considering moving right, try to compress the page
+ */
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+ /*
* Nope, so check conditions (b) and (c) enumerated above
*
* The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+ /*
+ * this may happen if tuple is bigger than freespace fallback to
+ * uncompressed page case
+ */
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ return false;
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+ return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel))
+ && (!IsSystemRelation(rel))
+ && (!rel->rd_index->indisunique));
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page that has already contain
+ * mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+
+ /*
+ * If we have only a few uncompressed items on the full page,
+ * it isn't worth to compress them
+ */
+ if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ add_item_to_posting(compressState, itup);
+ }
+ else if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index. Save
+ * current tuple for the next iteration.
+ */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+ return;
+ }
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+ return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db..dff506d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ int i;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..11e45c8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,42 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..49a1aae 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -665,6 +668,9 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ * TODO when itup is a posting tuple, the check becomes more complex.
+ * we have an option that key nor smaller, nor larger than the tuple,
+ * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
*/
heapTid = BTreeTupleGetHeapTID(itup);
if (key->scantid == NULL)
@@ -1410,6 +1416,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
bool continuescan;
int indnatts;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1463,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1498,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1546,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1554,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1589,8 +1625,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1639,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1653,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..955a628 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "catalog/index.h"
#include "commands/progress.h"
#include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+ BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(!BTreeTupleIsPosting(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
- /* _bt_sortaddtup() will perform full truncation later */
+
+ /*
+ * Stashed copy must be a non-posting tuple, with truncated posting
+ * list and correct t_tid since we're going to use it to build
+ * downlink.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+ * need TID later, set t_tid to the first t_tid from posting list.
+ */
+ keytupsz = BTreeTupleGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy(keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+ state->btps_minkey = CopyIndexTuple(keytup);
+ pfree(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup); /* _bt_sortaddtup() will
+ * perform full
+ * truncation later */
+
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1175,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index))
+ && (!IsSystemRelation(wstate->index))
+ && (!wstate->index->rd_index->indisunique));
if (merge)
{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ add_item_to_posting(compressState, itup);
+ else
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..0da6fa8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4..7daadc9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..7d0d456 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
-/* Get/set downlink block number */
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -567,6 +741,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +755,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +939,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -813,6 +991,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1006,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
17.07.2019 19:36, Anastasia Lubennikova:
There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this
invariant lazily.Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to
implement
insertion into the middle of the posting list. I'll send a new version
this week.
Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:
set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;
The last insert triggers several insertions that can be seen in debug
messages.
I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.
Please, in your review, pay particular attention to usage of
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.
Next week I'm going to check performance and try to find specific
scenarios where this
feature can lead to degradation and measure it, to understand if we need
to make this deduplication optional.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
0001-btree_compression_pg12_v2.patchtext/x-patch; name=0001-btree_compression_pg12_v2.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 9126c18..2b05b1e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1033,12 +1033,34 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+ int i;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..26ddf32 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
#include "access/tableam.h"
#include "access/transam.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "miscadmin.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
@@ -56,6 +57,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -759,6 +762,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -806,6 +815,11 @@ _bt_findinsertloc(Relation rel,
}
/*
+ * Before considering moving right, try to compress the page
+ */
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+
+ /*
* Nope, so check conditions (b) and (c) enumerated above
*
* The earlier _bt_check_unique() call may well have established a
@@ -2286,3 +2300,241 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+ /*
+ * this may happen if tuple is bigger than freespace fallback to
+ * uncompressed page case
+ */
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ return false;
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+ return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel))
+ && (!IsSystemRelation(rel))
+ && (!rel->rd_index->indisunique));
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page that has already contain
+ * mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+
+ /*
+ * If we have only a few uncompressed items on the full page,
+ * it isn't worth to compress them
+ */
+ if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ add_item_to_posting(compressState, itup);
+ }
+ else if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index. Save
+ * current tuple for the next iteration.
+ */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+ return;
+ }
+
+ START_CRIT_SECTION();
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+ return;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 50455db..dff506d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,14 +1022,53 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ int i;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1059,6 +1098,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1072,6 +1113,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..11e45c8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,42 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int i,
+ remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..49a1aae 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -665,6 +668,9 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ * TODO when itup is a posting tuple, the check becomes more complex.
+ * we have an option that key nor smaller, nor larger than the tuple,
+ * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
*/
heapTid = BTreeTupleGetHeapTID(itup);
if (key->scantid == NULL)
@@ -1410,6 +1416,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
int itemIndex;
bool continuescan;
int indnatts;
+ int i;
/*
* We must have the buffer pinned and locked, but the usual macro can't be
@@ -1456,6 +1463,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1498,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
+ else
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1546,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1554,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (BTreeTupleIsPosting(itup))
+ {
+ for (i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
+ else
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+
}
if (!continuescan)
{
@@ -1589,8 +1625,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1639,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1653,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..955a628 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -65,6 +65,7 @@
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
+#include "catalog/catalog.h"
#include "catalog/index.h"
#include "commands/progress.h"
#include "miscadmin.h"
@@ -288,6 +289,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void insert_itupprev_to_page_buildadd(BTWriteState *wstate,
+ BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +976,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1020,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(!BTreeTupleIsPosting(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1050,8 +1060,36 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
if (last_off == P_HIKEY)
{
Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
- /* _bt_sortaddtup() will perform full truncation later */
+
+ /*
+ * Stashed copy must be a non-posting tuple, with truncated posting
+ * list and correct t_tid since we're going to use it to build
+ * downlink.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ Size keytupsz;
+ IndexTuple keytup;
+
+ /*
+ * Form key tuple, that doesn't contain any ipd. NOTE: since we'll
+ * need TID later, set t_tid to the first t_tid from posting list.
+ */
+ keytupsz = BTreeTupleGetPostingOffset(itup);
+ keytup = palloc0(keytupsz);
+ memcpy(keytup, itup, keytupsz);
+
+ keytup->t_info &= ~INDEX_SIZE_MASK;
+ keytup->t_info |= keytupsz;
+ ItemPointerCopy(BTreeTupleGetPosting(itup), &keytup->t_tid);
+ state->btps_minkey = CopyIndexTuple(keytup);
+ pfree(keytup);
+ }
+ else
+ state->btps_minkey = CopyIndexTuple(itup); /* _bt_sortaddtup() will
+ * perform full
+ * truncation later */
+
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1175,89 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page, while building index.
+ */
+void
+insert_itupprev_to_page_buildadd(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ * Helper function for bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that
+ * resulting tuple won't exceed BTMaxItemSize.
+ */
+void
+add_item_to_posting(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd, BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1271,21 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns, system indexes
+ * and unique indexes.
+ */
+ use_compression = ((IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index))
+ && (!IsSystemRelation(wstate->index))
+ && (!wstate->index->rd_index->indisunique));
if (merge)
{
@@ -1266,19 +1399,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ add_item_to_posting(compressState, itup);
+ else
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ insert_itupprev_to_page_buildadd(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..0da6fa8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1787,7 +1787,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2147,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2168,6 +2180,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2238,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2249,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2267,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2276,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2367,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2456,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2511,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2538,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2590,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2618,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 3147ea4..7daadc9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -384,8 +384,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -476,14 +476,35 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f2..7d0d456 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
-/* Get/set downlink block number */
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -567,6 +741,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +755,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +939,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -813,6 +991,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1006,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void add_item_to_posting(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
0002-btree_compression_pg12_v2.patchtext/x-patch; name=0002-btree_compression_pg12_v2.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 26ddf32..c7bb25a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -42,6 +42,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -51,7 +62,7 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, int in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -300,10 +311,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -914,6 +932,162 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF(pageop));
+
+ /*
+ * Force pull page write to keep code simple
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple oldtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Page page;
+ int nipd, nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ oldtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+ Assert(BTreeTupleIsPosting(oldtup));
+ nipd = BTreeTupleGetNPosting(oldtup);
+
+ /* At first, check if the new itempointer fits into the tuple's posting list.
+ * Also check if new itempointer fits into the page.
+ * If not, posting tuple's split is required in both cases.
+ */
+ if ((BTMaxItemSize(page) < (IndexTupleSize(oldtup) + sizeof(ItemIdData))) ||
+ PageGetFreeSpace(page) < IndexTupleSize(oldtup) + sizeof(ItemPointerData))
+ {
+ /*
+ * Split posting tuple into two halves.
+ * Left tuple contains all item pointes less than the new one
+ * and right tuple contains new item pointer and all to the right.
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(oldtup, BTreeTupleGetPosting(oldtup), in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd_right));
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple */
+ memcpy(ipd+1,
+ BTreeTupleGetPostingN(oldtup, in_posting_offset),
+ sizeof(ItemPointerData)*(nipd-in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(oldtup, ipd, nipd_right);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ * And insert righttuple using ordinary _bt_insertonpg() function
+ * If split is required, _bt_insertonpg will handle it.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(oldtup), sizeof(ItemPointerData)*in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd+in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd+in_posting_offset+1,
+ BTreeTupleGetPostingN(oldtup, in_posting_offset),
+ sizeof(ItemPointerData)*(nipd-in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd+1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -1010,7 +1184,7 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup, 0);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1228,7 +1402,8 @@ _bt_insertonpg(Relation rel,
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ int in_posting_offset)
{
Buffer rbuf;
Page origpage;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 49a1aae..58a050f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -507,7 +507,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid, &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -536,6 +536,45 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact postition inside the posting list,
+ * using TID as extra attribut.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offnum));
+ int result = _bt_compare(rel, key, page, offnum);
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low, high, mid, res;
+
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid, BTreeTupleGetPostingN(itup, mid));
+
+ if (res == -1)
+ high = mid;
+ else
+ low = mid + 1;
+ }
+ *in_posting_offset = mid;
+ }
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -668,64 +707,112 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
- * TODO when itup is a posting tuple, the check becomes more complex.
- * we have an option that key nor smaller, nor larger than the tuple,
- * but exactly in between of BTreeTupleGetMinTID to BTreeTupleGetMaxTID.
+ *
+ * When itup is a posting tuple, the check becomes more complex.
+ * It is possible that the scankey belongs to the tuple's posting list
+ * TID range.
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that
+ * key matches tuple at this offset.
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values for
+ * attributes up to and including the least significant untruncated
+ * attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high key
+ * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
+ * will not descend to the page to the left. The search will descend
+ * right instead. The truncated attribute in pivot tuple means that
+ * all non-pivot tuples on the page to the left are strictly < 'foo',
+ * so it isn't necessary to descend left. In other words, search
+ * doesn't have to descend left because it isn't interested in a match
+ * that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require that
+ * we descend left when this happens. -inf is treated as a possible
+ * match for omitted scankey attribute(s). This is needed by page
+ * deletion, which must re-find leaf pages that are targets for
+ * deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is being
+ * compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
+ * left here, since they have no heap TID attribute (and cannot have
+ * any -inf key values in any case, since truncation can only remove
+ * non-key attributes). !heapkeyspace searches must always be
+ * prepared to deal with matches on both sides of the pivot once the
+ * leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
- */
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
+ */
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetMinTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ /* if we got here, scantid is inbetween of posting items of the tuple */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
}
/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7d0d456..918043f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -675,6 +675,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location
+ * inside existing posting list,
+ * save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -953,6 +960,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:
Nice!
I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.
I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.
You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.
Please, in your review, pay particular attention to usage of
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.
One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.
Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.
Next week I'm going to check performance and try to find specific
scenarios where this
feature can lead to degradation and measure it, to understand if we need
to make this deduplication optional.
Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.
My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.
--
Peter Geoghegan
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Patch 0002 (must be applied on top of 0001) implements preserving of
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:Nice!
Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:
$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR: down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL: Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.
Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.
--
Peter Geoghegan
Attachments:
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:
Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.
I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.
Other changes:
* We now support system catalog indexes. There is no reason not to support them.
* Removed unnecessary code from _bt_buildadd().
* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.
* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.
* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?
* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.
* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.
I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):
* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.
Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)
I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.
* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.
* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.
* I agree with all of your existing TODO items -- most of them seem
very important to me.
* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.
* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.
My next steps will be to study the design of the
_bt_insertonpg_in_posting() stuff some more. It seems like you already
have the right general idea there, but I would like to come up with a
way of making _bt_insertonpg_in_posting() understand how to work with
space on the page with total certainty, much like nbtsplitloc.c does
today. This should allow us to make WAL-logging more
precise/incremental.
--
Peter Geoghegan
Attachments:
v3-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v3-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload
From bfa3121169f98d9bc8b8cce71502b98814c90f1f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v3 2/2] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 65 +++++++++++++++----
contrib/pageinspect/expected/btree.out | 3 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 +++++++
3 files changed, 76 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..30e2865076 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[7];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer htid;
id = PageGetItemId(page, offset);
@@ -283,16 +286,51 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (!rel || !_bt_heapkeyspace(rel))
+ htid = NULL;
+ else if (!BTreeTupleIsPivot(itup))
+ htid = BTreeTupleGetMinTID(itup);
+ else
+ htid = BTreeTupleGetHeapTID(itup);
+
+ if (htid)
+ values[j] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(htid),
+ ItemPointerGetOffsetNumberNoCheck(htid));
+ else
+ values[j] = NULL;
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +404,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +435,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +521,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v3-0001-Compression-deduplication-in-nbtree.patchapplication/x-patch; name=v3-0001-Compression-deduplication-in-nbtree.patchDownload
From 1f5d732152bfbee6008249a9619d9e80f868e7f8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Jul 2019 18:57:31 -0700
Subject: [PATCH v3 1/2] Compression/deduplication in nbtree.
Version with some revisions by me.
---
contrib/amcheck/verify_nbtree.c | 140 ++++++--
src/backend/access/nbtree/nbtinsert.c | 455 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 53 +++
src/backend/access/nbtree/nbtree.c | 142 ++++++--
src/backend/access/nbtree/nbtsearch.c | 283 ++++++++++++---
src/backend/access/nbtree/nbtsort.c | 197 +++++++++-
src/backend/access/nbtree/nbtsplitloc.c | 7 +
src/backend/access/nbtree/nbtutils.c | 173 ++++++++-
src/backend/access/nbtree/nbtxlog.c | 35 +-
src/backend/access/rmgrdesc/nbtdesc.c | 6 +-
src/include/access/itup.h | 5 +
src/include/access/nbtree.h | 215 ++++++++++-
src/include/access/nbtxlog.h | 13 +-
13 files changed, 1571 insertions(+), 153 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4bbe0..19239410ff 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -959,29 +960,79 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetMinTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ *
+ * FIXME: The calls to BTreeGetNthTupleOfPosting() allocate memory,
+ * and are probably relatively expensive. We should at least try to
+ * make this happen at the same point that optional heapallindexed
+ * verification needs to loop through each posting list.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+ ItemPointerData last;
+
+ ItemPointerCopy(BTreeTupleGetMinTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ if (ItemPointerCompare(&onetup->t_tid, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(&(onetup->t_tid)),
+ ItemPointerGetOffsetNumberNoCheck(&(onetup->t_tid)));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(&onetup->t_tid, &last);
+ /* Be tidy */
+ pfree(onetup);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1039,12 +1090,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1052,7 +1124,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1092,6 +1165,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1191,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1129,11 +1206,16 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ if (!BTreeTupleIsPivot(itup))
+ tid = BTreeTupleGetMinTID(itup);
+ else
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1142,9 +1224,14 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ if (!BTreeTupleIsPivot(itup))
+ tid = BTreeTupleGetMinTID(itup);
+ else
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1241,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1918,10 +2005,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2613,20 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ /* XXX: Again, I wonder if we need both of these macros... */
+ if (!BTreeTupleIsPivot(itup))
+ result = BTreeTupleGetMinTID(itup);
+ else
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f8849d4..b6407b80b6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static bool insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -759,6 +779,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -900,6 +926,191 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple origtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Page page;
+ int nipd,
+ nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ origtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /*
+ * At first, check if the new itempointer fits into the tuple's posting
+ * list.
+ *
+ * Also check if new itempointer fits into the page.
+ *
+ * If not, posting tuple's split is required in both cases.
+ *
+ * XXX: Think some more about alignment - pg
+ */
+ if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+ PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+ {
+ /*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+ elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+ nipd, in_posting_offset);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+ BTreeTupleGetMinTID(righttup)) < 0);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ *
+ * And insert righttuple using ordinary _bt_insertonpg() function If
+ * split is required, _bt_insertonpg will handle it.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff + 1, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+ elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+ /* copy item pointers from original tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(origtup),
+ sizeof(ItemPointerData) * in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd + in_posting_offset + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -2286,3 +2497,243 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static bool
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ elog(DEBUG4, "insert_itupprev_to_page. failed");
+
+ /*
+ * this may happen if tuple is bigger than freespace fallback to
+ * uncompressed page case
+ */
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+
+ return false;
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+
+ return true;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page that has already contain
+ * mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+
+ /*
+ * If we have only a few uncompressed items on the full page, it isn't
+ * worth to compress them
+ */
+ if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index. Save
+ * current tuple for the next iteration.
+ */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting");
+ return;
+ }
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ if (!insert_itupprev_to_page(newpage, compressState))
+ {
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert posting for last item");
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126743..707a5d0fdb 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..22fb228b81 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1430,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dadb96..3e53675c82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
+ }
+ else
+ {
+ heapTid = BTreeTupleGetMinTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
+
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
}
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ return 0;
}
/*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013caf..5545465f92 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1136,6 +1145,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd874..fbb12dbff1 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -492,6 +492,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ *
+ * FIXME: We can make better choices about split points by being clever
+ * about the BTreeTupleIsPosting() case here. All we need to do is
+ * subtract the whole size of the posting list, then add
+ * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
+ * _bt_truncate() won't make a final high key that is larger even in the
+ * worst case.
*/
if (state->is_leaf)
leftfree -= (int16) (firstrightitemsz +
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..a6eee1bcd4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ /*
+ * XXX: Do we need to have both BTreeTupleGetHeapTID() and
+ * BTreeTupleGetMinTID()?
+ */
+ if (itup && key->heapkeyspace)
+ {
+ if (!BTreeTupleIsPivot(itup))
+ key->scantid = BTreeTupleGetMinTID(itup);
+ else
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ }
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1800,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2160,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2186,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2195,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2253,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2264,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2282,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2291,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2382,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2486,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2541,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2620,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2648,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..5b30e36d27 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,35 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..e4fa99ad27 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..85ee040428 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c28e..d3e3cea60a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +674,12 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -567,6 +747,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +761,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +945,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +959,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +999,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1014,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc86ea..6f60ca5f7b 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
--
2.17.1
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3.
Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.
This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.
There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.
--
Peter Geoghegan
Attachments:
0003-Account-for-posting-list-overhead-during-splits.patchapplication/octet-stream; name=0003-Account-for-posting-list-overhead-during-splits.patchDownload
From 36147525a12101d8bde6c00a238759cd371eefcc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 24 Jul 2019 14:35:13 -0700
Subject: [PATCH 3/3] Account for posting list overhead during splits.
---
src/backend/access/nbtree/nbtsplitloc.c | 37 +++++++++++++++++++------
1 file changed, 29 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index fbb12dbff1..77e1d46672 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,16 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
- *
- * FIXME: We can make better choices about split points by being clever
- * about the BTreeTupleIsPosting() case here. All we need to do is
- * subtract the whole size of the posting list, then add
- * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
- * _bt_truncate() won't make a final high key that is larger even in the
- * worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
--
2.17.1
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.
I found another important way in which this synergy can fail to take
place, which I can fix.
By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.
I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.
I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.
--
Peter Geoghegan
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.I found another important way in which this synergy can fail to take
place, which I can fix.By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
--
Regards,
Rafia Sabih
24.07.2019 4:22, Peter Geoghegan wrote:
Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.
Thank you for this review and fixes.
* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?
It's ok to update it. There was no particular reason, just my habit.
* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.
+ * FIXME: The calls to BTreeGetNthTupleOfPosting() allocate
memory,
If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not
tuples.
Fixed in version 4.
* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.
Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.
I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.
The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out
of space
on the page.
First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.
Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key
and compressed result is:
t_tid | t_info | key | t_tid | t_tid
If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.
I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.
Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.
* I agree with all of your existing TODO items -- most of them seem
very important to me.* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
BTreeTupleGetMinTID(righttup)) < 0);
* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.
The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.
One more thing I want to discuss:
/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;
In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a
bug.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v4-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v4-0001-Compression-deduplication-in-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4b..b8c1d03 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -959,29 +960,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetMinTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetMinTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1039,12 +1084,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1052,7 +1118,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1092,6 +1159,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1185,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1129,11 +1200,16 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ if (!BTreeTupleIsPivot(itup))
+ tid = BTreeTupleGetMinTID(itup);
+ else
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1142,9 +1218,14 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ if (!BTreeTupleIsPivot(itup))
+ tid = BTreeTupleGetMinTID(itup);
+ else
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1235,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1918,10 +1999,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2607,20 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ /* XXX: Again, I wonder if we need both of these macros... */
+ if (!BTreeTupleIsPivot(itup))
+ result = BTreeTupleGetMinTID(itup);
+ else
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 602f884..57b6bb5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -759,6 +779,12 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
}
else
{
@@ -900,6 +926,191 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple origtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ Page page;
+ int nipd,
+ nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ origtup = (IndexTuple) PageGetItem(page, PageGetItemId(page, newitemoff));
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /*
+ * At first, check if the new itempointer fits into the tuple's posting
+ * list.
+ *
+ * Also check if new itempointer fits into the page.
+ *
+ * If not, posting tuple's split is required in both cases.
+ *
+ * XXX: Think some more about alignment - pg
+ */
+ if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+ PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+ {
+ /*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+ elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+ nipd, in_posting_offset);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+ BTreeTupleGetMinTID(righttup)) < 0);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ *
+ * And insert righttuple using ordinary _bt_insertonpg() function If
+ * split is required, _bt_insertonpg will handle it.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff + 1, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+ elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+ /* copy item pointers from original tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(origtup),
+ sizeof(ItemPointerData) * in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd + in_posting_offset + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -2286,3 +2497,221 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ {
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ elog(ERROR, "failed to add tuple to page while compresing it");
+ }
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int n_posting_on_page = 0;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Heuristic to avoid trying to compress page that has already contain
+ * mostly compressed items
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (BTreeTupleIsPosting(item))
+ n_posting_on_page++;
+ }
+
+ /*
+ * If we have only a few uncompressed items on the full page, it isn't
+ * worth to compress them
+ */
+ if (maxoff - n_posting_on_page < BT_COMPRESS_THRESHOLD)
+ return;
+
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ {
+ /*
+ * Should never happen. Anyway, fallback gently to scenario of
+ * incompressible page and just return from function.
+ */
+ elog(DEBUG4, "_bt_compress_one_page. failed to insert highkey to newpage");
+ return;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5962126..707a5d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..3e53675 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetMinTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMinTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savePostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d0b9013..5545465f9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..fbb12db 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -492,6 +492,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ *
+ * FIXME: We can make better choices about split points by being clever
+ * about the BTreeTupleIsPosting() case here. All we need to do is
+ * subtract the whole size of the posting list, then add
+ * MAXALIGN(sizeof(ItemPointerData)), since we know for sure that
+ * _bt_truncate() won't make a final high key that is larger even in the
+ * worst case.
*/
if (state->is_leaf)
leftfree -= (int16) (firstrightitemsz +
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab26..a6eee1b 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ /*
+ * XXX: Do we need to have both BTreeTupleGetHeapTID() and
+ * BTreeTupleGetMinTID()?
+ */
+ if (itup && key->heapkeyspace)
+ {
+ if (!BTreeTupleIsPivot(itup))
+ key->scantid = BTreeTupleGetMinTID(itup);
+ else
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ }
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1800,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2160,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2186,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2195,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2253,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2264,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetMinTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2282,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMinTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2291,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetMinTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2382,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2486,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2541,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2620,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2648,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..5b30e36 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,35 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ int i;
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..85ee040 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,11 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
+
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..3127c41 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,157 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/*
+ * For use in _bt_compress_one_page().
+ * If there is only a few uncompressed items on a page,
+ * it isn't worth to apply compression.
+ * Currently it is just a magic number,
+ * proper benchmarking will probably help to choose better value.
+ */
+#define BT_COMPRESS_THRESHOLD 10
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
-/* Get/set downlink block number */
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset))
+
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain several TIDs.
+ * Some functions that use TID as a tiebreaker,
+ * to ensure correct order of TID keys they can use two macros below:
+ */
+#define BTreeTupleGetMinTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) BTreeTupleGetPosting(itup) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +492,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +502,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +510,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuple it returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +521,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +533,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -501,6 +675,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -567,6 +747,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +761,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +945,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +959,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +999,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1014,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 9beccc8..6f60ca5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.
Good idea.
Attached is v5, which is based on your v4. The three main differences
between this and v4 are:
* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.
* We now have experimental support for unique indexes. This is broken
out into its own patch.
* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().
As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.
If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.
I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.
I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special
tests.
It also passes my tests, FWIW.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.
That sounds correct, then.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().
The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.
I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.
BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.
if (ItemIdIsDead(itemId))
continue;In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.
v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.
How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
--
Peter Geoghegan
Attachments:
v5-0001-Compression-deduplication-in-nbtree.patchapplication/octet-stream; name=v5-0001-Compression-deduplication-in-nbtree.patchDownload
From 1df33bd12aaf21179da6d3aedaa7a2084e577d25 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 Jul 2019 18:57:31 -0700
Subject: [PATCH v5 1/3] Compression/deduplication in nbtree.
Version with some revisions by me.
---
contrib/amcheck/verify_nbtree.c | 124 +++++--
src/backend/access/nbtree/nbtinsert.c | 430 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 53 +++
src/backend/access/nbtree/nbtree.c | 142 ++++++--
src/backend/access/nbtree/nbtsearch.c | 283 +++++++++++++---
src/backend/access/nbtree/nbtsort.c | 197 ++++++++++-
src/backend/access/nbtree/nbtsplitloc.c | 30 +-
src/backend/access/nbtree/nbtutils.c | 164 ++++++++-
src/backend/access/nbtree/nbtxlog.c | 34 +-
src/backend/access/rmgrdesc/nbtdesc.c | 6 +-
src/include/access/itup.h | 4 +
src/include/access/nbtree.h | 202 ++++++++++-
src/include/access/nbtxlog.h | 13 +-
13 files changed, 1528 insertions(+), 154 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 55a3a4bbe0..da79dd1b62 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -889,6 +889,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -959,29 +960,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1039,12 +1084,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1052,7 +1118,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1092,6 +1159,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1115,6 +1185,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1129,11 +1200,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1142,9 +1215,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1154,10 +1229,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1918,10 +1993,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2525,14 +2601,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f393f6..f0c1174e2a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -412,6 +432,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
}
curitemid = PageGetItemId(page, offset);
+ Assert(!BTreeTupleIsPosting(curitup));
/*
* We can skip items that are marked killed.
@@ -759,6 +780,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * compression avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +941,208 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple origtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ ItemId itemid;
+ Page page;
+ int nipd,
+ nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ itemid = PageGetItemId(page, newitemoff);
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /*
+ * Fist check if existing item is dead.
+ *
+ * Then check if the new itempointer fits into the tuple's posting list.
+ *
+ * Also check if new itempointer fits into the page.
+ *
+ * If not, posting tuple's split is required in both cases.
+ *
+ * XXX: Think some more about alignment - pg
+ */
+ if (ItemIdIsDead(itemid))
+ {
+ /* FIXME: We need to call index_compute_xid_horizon_for_tuples() */
+ elog(DEBUG4, "replacing LP_DEAD posting list item, new off %d",
+ newitemoff);
+ _bt_delete_and_insert(rel, buf, itup, newitemoff);
+ _bt_relbuf(rel, buf);
+ }
+ else if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+ PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+ {
+ /*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+ elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+ nipd, in_posting_offset);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+ BTreeTupleGetHeapTID(righttup)) < 0);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ *
+ * And insert righttuple using ordinary _bt_insertonpg() function If
+ * split is required, _bt_insertonpg will handle it.
+ *
+ * FIXME: This doesn't seem very crash safe -- what if we fail after
+ * _bt_delete_and_insert() but before _bt_insertonpg()? We could
+ * crash and then lose some of the logical tuples that used to be
+ * contained within original posting list, but will now go into new
+ * righttup posting list.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff + 1, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+ elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+ /* copy item pointers from original tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(origtup),
+ sizeof(ItemPointerData) * in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd + in_posting_offset + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -2290,3 +2533,186 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during compression");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * We do not expect to meet any DEAD items, since this function is
+ * called right after _bt_vacuum_one_page(). If for some reason we
+ * found dead item, don't compress it, to allow upcoming microvacuum
+ * or vacuum clean it up.
+ */
+ if (ItemIdIsDead(itemId))
+ continue;
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de60f..86c662d4e6 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..22fb228b81 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1430,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf733..20975970d6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
+ }
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
+
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
}
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
-
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ return 0;
}
/*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e989..b058599aa4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1136,6 +1145,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd874..77e1d46672 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 93fab264ae..75ba61c0c9 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..538a6bc8a7 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..e4fa99ad27 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6c61..b10c0d5255 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c28e..bacc77b258 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +661,12 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +946,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1001,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..4b615e0d36 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
--
2.17.1
v5-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v5-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 20f251e1c3fb9da636f0844f3db9406a2090d548 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v5 3/3] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 63 +++++++++++++++----
contrib/pageinspect/expected/btree.out | 3 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 +++++++
3 files changed, 74 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..64423283a6 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[7];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer htid;
id = PageGetItemId(page, offset);
@@ -283,16 +286,49 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (!rel || !_bt_heapkeyspace(rel))
+ htid = NULL;
+ else
+ htid = BTreeTupleGetHeapTID(itup);
+
+ if (htid)
+ values[j] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(htid),
+ ItemPointerGetOffsetNumberNoCheck(htid));
+ else
+ values[j] = NULL;
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +402,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +433,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +519,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v5-0002-Experimental-support-for-unique-indexes.patchapplication/octet-stream; name=v5-0002-Experimental-support-for-unique-indexes.patchDownload
From 7fd5af8b4767516c905d7cbbdc942e8d4643d025 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 27 Jul 2019 16:34:31 -0700
Subject: [PATCH v5 2/3] Experimental support for unique indexes.
I have written a pretty sloppy implementation of unique index support
for posting list compression, just to give us an idea of how it could be
done. This seems to be a loss for performance, so it's unlikely to go
much further than this.
---
src/backend/access/gist/gist.c | 3 +-
src/backend/access/hash/hashinsert.c | 4 +-
src/backend/access/index/genam.c | 26 +++++++++-
src/backend/access/nbtree/nbtinsert.c | 71 +++++++++++++++++++++++----
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 2 +-
src/backend/access/nbtree/nbtsort.c | 3 +-
src/include/access/genam.h | 3 +-
8 files changed, 96 insertions(+), 18 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e9ca4b8252..cfdea23cec 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1650,7 +1650,8 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
index_compute_xid_horizon_for_tuples(rel, heapRel, buffer,
- deletable, ndeletable);
+ deletable, ndeletable,
+ false);
if (ndeletable > 0)
{
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 89876d2ccd..807e0ecd84 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -362,8 +362,8 @@ _hash_vacuum_one_page(Relation rel, Relation hrel, Buffer metabuf, Buffer buf)
TransactionId latestRemovedXid;
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, hrel, buf,
- deletable, ndeletable);
+ index_compute_xid_horizon_for_tuples(rel, hrel, buf, deletable,
+ ndeletable, false);
/*
* Write-lock the meta page so that we can decrement tuple count.
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..c075e6c7c7 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -273,6 +273,8 @@ BuildIndexValueDescription(Relation indexRelation,
return buf.data;
}
+#include "access/nbtree.h"
+
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
@@ -282,7 +284,8 @@ index_compute_xid_horizon_for_tuples(Relation irel,
Relation hrel,
Buffer ibuf,
OffsetNumber *itemnos,
- int nitems)
+ int nitems,
+ bool btree)
{
ItemPointerData *ttids =
(ItemPointerData *) palloc(sizeof(ItemPointerData) * nitems);
@@ -298,6 +301,27 @@ index_compute_xid_horizon_for_tuples(Relation irel,
iitemid = PageGetItemId(ipage, itemnos[i]);
itup = (IndexTuple) PageGetItem(ipage, iitemid);
+ if (btree)
+ {
+ /*
+ * FIXME: This is a gross modularity violation. Clearly B-Tree
+ * ought to pass us heap TIDs, and not require that we figure it
+ * out on its behalf. Also, this is just wrong, since we're
+ * assuming that the oldest xmin is available from the lowest heap
+ * TID.
+ *
+ * I haven't bothered to fix this because unique index support is
+ * just a PoC, and will probably stay that way. Also, since
+ * WAL-logging is currently very inefficient, it doesn't seem very
+ * likely that anybody will get an overly-optimistic view of the
+ * cost of WAL logging just because we were sloppy here.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &ttids[i]);
+ continue;
+ }
+ }
ItemPointerCopy(&itup->t_tid, &ttids[i]);
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f0c1174e2a..4da28d9518 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -432,7 +432,6 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
}
curitemid = PageGetItemId(page, offset);
- Assert(!BTreeTupleIsPosting(curitup));
/*
* We can skip items that are marked killed.
@@ -449,14 +448,34 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!ItemIdIsDead(curitemid))
{
ItemPointerData htid;
+ bool posting;
bool all_dead;
+ bool posting_all_dead;
+ int npost;
if (_bt_compare(rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ posting = false;
+ posting_all_dead = true;
+ }
+ else
+ {
+ posting = true;
+ /* Initial assumption */
+ posting_all_dead = true;
+ }
+
+ npost = 0;
+doposttup:
+ if (posting)
+ htid = *BTreeTupleGetPostingN(curitup, npost);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -467,6 +486,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
found = true;
+ posting_all_dead = false;
+ if (posting)
+ goto nextpost;
}
/*
@@ -532,8 +554,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -591,7 +612,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && !posting)
{
/*
* The conflicting tuple (or whole HOT chain) is dead to
@@ -610,6 +631,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+ else if (posting)
+ {
+nextpost:
+ if (!all_dead)
+ posting_all_dead = false;
+
+ /* Iterate over single posting list tuple */
+ npost++;
+ if (npost < BTreeTupleGetNPosting(curitup))
+ goto doposttup;
+
+ /*
+ * Mark posting tuple dead if all hot chains whose root is
+ * contained in posting tuple have tuples that are all
+ * dead
+ */
+ if (posting_all_dead)
+ {
+ ItemIdMarkDead(curitemid);
+ opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+ if (nbuf != InvalidBuffer)
+ MarkBufferDirtyHint(nbuf, true);
+ else
+ MarkBufferDirtyHint(insertstate->buf, true);
+ }
+
+ /* Move on to next index tuple */
+ }
}
}
@@ -784,7 +834,7 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, try to compress the page
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
_bt_compress_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false; /* paranoia */
@@ -2595,12 +2645,13 @@ _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
int natts = IndexRelationGetNumberOfAttributes(rel);
/*
- * Don't use compression for indexes with INCLUDEd columns and unique
- * indexes.
+ * Don't use compression for indexes with INCLUDEd columns.
+ *
+ * Unique indexes can benefit from ad-hoc compression, though we don't do
+ * this during CREATE INDEX.
*/
use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
- IndexRelationGetNumberOfAttributes(rel) &&
- !rel->rd_index->indisunique);
+ IndexRelationGetNumberOfAttributes(rel));
if (!use_compression)
return;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 86c662d4e6..985418065b 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1121,7 +1121,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ itemnos, nitems, true);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 20975970d6..ffcfd21593 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -557,7 +557,7 @@ _bt_compare_posting(Relation rel,
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
result = _bt_compare(rel, key, page, offnum);
- if (BTreeTupleIsPosting(itup) && result == 0)
+ if (BTreeTupleIsPosting(itup) && result == 0 && key->scantid)
{
int low,
high,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b058599aa4..846e60a452 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1253,7 +1253,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
/*
* Don't use compression for indexes with INCLUDEd columns and unique
- * indexes.
+ * indexes. Note that unique indexes are supported with retail
+ * insertions.
*/
use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
IndexRelationGetNumberOfAttributes(wstate->index) &&
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..f9866ce7f9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -193,7 +193,8 @@ extern TransactionId index_compute_xid_horizon_for_tuples(Relation irel,
Relation hrel,
Buffer ibuf,
OffsetNumber *itemnos,
- int nitems);
+ int nitems,
+ bool btree);
/*
* heap-or-index access to system catalogs (in genam.c)
--
2.17.1
06.08.2019 4:28, Peter Geoghegan wrote:
Attached is v5, which is based on your v4. The three main differences
between this and v4 are:* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.
Fair enough.
I think we can measure performance and make a decision, when patch will
stabilize.
* We now have experimental support for unique indexes. This is broken
out into its own patch.* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.
Could you explain more about these cases?
Now I don't understand the problem.
The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.
if (ItemIdIsDead(itemId))
continue;In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.
I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.
v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).
Is it possible that DEAD tuple to delete was smaller than itup?
I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.
Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.
How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
I agree.
Should I rename all related names of functions and variables in the patch?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v6-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v6-0001-Compression-deduplication-in-nbtree.patchDownload
commit 9ac37503c71f7623413a2e406d81f5c9a4b02742
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date: Tue Aug 13 17:00:41 2019 +0300
v6-0001-Compression-deduplication-in-nbtree.patch
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f39..e96f5ec 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -435,6 +455,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +780,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * compression avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +941,208 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple origtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ ItemId itemid;
+ Page page;
+ int nipd,
+ nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ itemid = PageGetItemId(page, newitemoff);
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /*
+ * Fist check if existing item is dead.
+ *
+ * Then check if the new itempointer fits into the tuple's posting list.
+ *
+ * Also check if new itempointer fits into the page.
+ *
+ * If not, posting tuple's split is required in both cases.
+ *
+ * XXX: Think some more about alignment - pg
+ */
+ if (ItemIdIsDead(itemid))
+ {
+ /* FIXME: We need to call index_compute_xid_horizon_for_tuples() */
+ elog(DEBUG4, "replacing LP_DEAD posting list item, new off %d",
+ newitemoff);
+ _bt_delete_and_insert(rel, buf, itup, newitemoff);
+ _bt_relbuf(rel, buf);
+ }
+ else if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+ PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+ {
+ /*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+ elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+ nipd, in_posting_offset);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+ BTreeTupleGetHeapTID(righttup)) < 0);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ *
+ * And insert righttuple using ordinary _bt_insertonpg() function If
+ * split is required, _bt_insertonpg will handle it.
+ *
+ * FIXME: This doesn't seem very crash safe -- what if we fail after
+ * _bt_delete_and_insert() but before _bt_insertonpg()? We could
+ * crash and then lose some of the logical tuples that used to be
+ * contained within original posting list, but will now go into new
+ * righttup posting list.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff + 1, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+ elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+ /* copy item pointers from original tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(origtup),
+ sizeof(ItemPointerData) * in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd + in_posting_offset + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -2290,3 +2533,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+
+ /*
+ * Delete dead tuples if any.
+ * We cannot simply skip them in the cycle below, because it's neccessary
+ * to generate special Xlog record containing such tuples to compute
+ * latestRemovedXid on a standby server later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during compression");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..86c662d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf..2097597 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..b058599 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..77e1d46 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..9552acb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..538a6bc 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..bacc77b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -501,6 +662,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +946,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1001,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..4b615e0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
13.08.2019 18:45, Anastasia Lubennikova wrote:
I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.
Attached is v7.
In this version of the patch, I heavily refactored the code of insertion
into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.
Now the algorithm is the following:
- If bt_findinsertloc() found out that tuple belongs to existing posting
tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()
- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.
It can result in one tuple neworigtup, that must replace origtup; or two
tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,
- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.
- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.
It seems that now all replace operations are crash-safe. The new patch
passes
all regression tests, so I think it's ready for review again.
In the meantime, I'll run more stress-tests.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v7-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v7-0001-Compression-deduplication-in-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f39..fed1e86 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,21 +41,28 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ Page page,
+ IndexTuple newitup, IndexTuple newitupright,
+ OffsetNumber newitemoff);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page);
+ bool split_only_page, int in_posting_offset);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple lefttup, IndexTuple righttup);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +304,13 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false,
+ insertstate.in_posting_offset);
}
else
{
@@ -435,6 +445,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +770,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * compression avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +931,77 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ *
+ * If original posting tuple was split, 'newitup' represents left part of
+ * original tuple and 'newitupright' is it's right part, that must be inserted
+ * next to newitemoff.
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ Page page,
+ IndexTuple newitup, IndexTuple newitupright,
+ OffsetNumber newitemoff)
+{
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ elog(DEBUG4, "_bt_delete_and_insert %s newitemoff %d",
+ RelationGetRelationName(rel), newitemoff);
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ if (newitupright)
+ {
+ if (!_bt_pgaddtup(page, MAXALIGN(IndexTupleSize(newitupright)),
+ newitupright, OffsetNumberNext(newitemoff)))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+ }
+
+ if (BufferIsValid(buf))
+ {
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+ }
+ END_CRIT_SECTION();
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -936,11 +1038,17 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page)
+ bool split_only_page,
+ int in_posting_offset)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple origtup;
+ int nipd;
+ IndexTuple neworigtup = NULL;
+ IndexTuple newrighttup = NULL;
+ bool need_split = false;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -965,13 +1073,184 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * TODO Think of moving this to a separate function.
+ *
+ * TODO possible optimization:
+ * if original posting tuple is dead,
+ * reset in_posting_offset and handle itup as a regular tuple
+ */
+ if (in_posting_offset)
+ {
+ /* get old posting tuple */
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ ItemPointerData *ipd;
+ int nipd, nipd_right;
+ bool need_posting_split = false;
+
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /* check if posting tuple must be splitted */
+ if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + sizeof(ItemPointerData))
+ need_posting_split = true;
+
+ /*
+ * If page split is needed, always split posting tuple.
+ * Probably that is not the most optimal,
+ * but it allows to simplify _bt_split code.
+ *
+ * TODO Does this decision have any significant drawbacks?
+ */
+ if (PageGetFreeSpace(page) < sizeof(ItemPointerData))
+ need_posting_split = true;
+
+ /*
+ * Handle corner cases (1)
+ * - itup TID is smaller than leftmost orightup TID
+ */
+ if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ BTreeTupleGetHeapTID(origtup)) < 0)
+ {
+ if (need_posting_split)
+ {
+ /*
+ * cannot avoid split, so no need in trying to fit itup into posting list.
+ * handle itup insertion as regular tuple insertion
+ */
+ elog(DEBUG4, "split posting tuple. itup is to the left of origtup");
+ in_posting_offset = InvalidOffsetNumber;
+ newitemoff = OffsetNumberPrev(newitemoff);
+ }
+ else
+ {
+ ipd = palloc0(nipd + 1);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+ neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+ pfree(ipd);
+
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+ BTreeTupleGetMaxTID(neworigtup)) < 0);
+ }
+ }
+
+ /*
+ * Handle corner cases (2)
+ * - itup TID is larger than rightmost orightup TID
+ */
+ if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+ BTreeTupleGetHeapTID(itup)) < 0)
+ {
+ if (need_posting_split)
+ {
+ /*
+ * cannot avoid split, so no need in trying to fit itup into posting list.
+ * handle itup insertion as regular tuple insertion
+ */
+ elog(DEBUG4, "split posting tuple. itup is to the right of origtup");
+ in_posting_offset = InvalidOffsetNumber;
+ }
+ else
+ {
+ ipd = palloc0(nipd + 1);
+ /* insert new item pointer */
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+ pfree(ipd);
+
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+ BTreeTupleGetMaxTID(neworigtup)) < 0);
+ }
+ }
+
+ /*
+ * itup TID belongs to TID range of origtup posting list
+ *
+ * Split posting tuple into two halves.
+ *
+ * neworigtup (left) tuple contains all item pointers less than the new one and
+ * newrighttup tuple contains new item pointer and all to the right.
+ */
+ if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ BTreeTupleGetHeapTID(origtup)) > 0
+ &&
+ ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+ BTreeTupleGetHeapTID(itup)) > 0)
+ {
+ neworigtup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+
+ elog(DEBUG4, "split posting tuple in_posting_offset %d nipd %d nipd_right %d",
+ in_posting_offset, nipd, nipd_right);
+
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newrighttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+ BTreeTupleGetHeapTID(newrighttup)) < 0);
+ pfree(ipd);
+
+ elog(DEBUG4, "left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+ BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ BTreeTupleIsPosting(newrighttup)?BTreeTupleGetNPosting(newrighttup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)));
+
+ /*
+ * check if splitted tuple still fit into original page
+ * TODO should we add sizeof(ItemIdData) in this check?
+ */
+ if (PageGetFreeSpace(page) < (MAXALIGN(IndexTupleSize(neworigtup))
+ + MAXALIGN(IndexTupleSize(newrighttup))
+ - MAXALIGN(IndexTupleSize(origtup))))
+ need_split = true;
+ }
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
* so this comparison is correct even though we appear to be accounting
* only for the item and not for its line pointer.
*/
- if (PageGetFreeSpace(page) < itemsz)
+ if (PageGetFreeSpace(page) < itemsz || need_split)
{
bool is_root = P_ISROOT(lpageop);
bool is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
@@ -996,7 +1275,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ neworigtup, newrighttup);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1033,142 +1313,159 @@ _bt_insertonpg(Relation rel,
itup_off = newitemoff;
itup_blkno = BufferGetBlockNumber(buf);
- /*
- * If we are doing this insert because we split a page that was the
- * only one on its tree level, but was not the root, it may have been
- * the "fast root". We need to ensure that the fast root link points
- * at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
- */
- if (split_only_page)
+ if (!in_posting_offset)
{
- Assert(!P_ISLEAF(lpageop));
-
- metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
- metapg = BufferGetPage(metabuf);
- metad = BTPageGetMeta(metapg);
-
- if (metad->btm_fastlevel >= lpageop->btpo.level)
+ /*
+ * If we are doing this insert because we split a page that was the
+ * only one on its tree level, but was not the root, it may have been
+ * the "fast root". We need to ensure that the fast root link points
+ * at or above the current page. We can safely acquire a lock on the
+ * metapage here --- see comments for _bt_newroot().
+ */
+ if (split_only_page)
{
- /* no update wanted */
- _bt_relbuf(rel, metabuf);
- metabuf = InvalidBuffer;
- }
- }
-
- /*
- * Every internal page should have exactly one negative infinity item
- * at all times. Only _bt_split() and _bt_newroot() should add items
- * that become negative infinity items through truncation, since
- * they're the only routines that allocate new internal pages. Do not
- * allow a retail insertion of a new item at the negative infinity
- * offset.
- */
- if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
- elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ Assert(!P_ISLEAF(lpageop));
- /* Do the update. No ereport(ERROR) until changes are logged */
- START_CRIT_SECTION();
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
- if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
- elog(PANIC, "failed to add new item to block %u in index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ if (metad->btm_fastlevel >= lpageop->btpo.level)
+ {
+ /* no update wanted */
+ _bt_relbuf(rel, metabuf);
+ metabuf = InvalidBuffer;
+ }
+ }
- MarkBufferDirty(buf);
+ /*
+ * Every internal page should have exactly one negative infinity item
+ * at all times. Only _bt_split() and _bt_newroot() should add items
+ * that become negative infinity items through truncation, since
+ * they're the only routines that allocate new internal pages. Do not
+ * allow a retail insertion of a new item at the negative infinity
+ * offset.
+ */
+ if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+ elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ /* Do the update. No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+ elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
- if (BufferIsValid(metabuf))
- {
- /* upgrade meta-page if needed */
- if (metad->btm_version < BTREE_NOVAC_VERSION)
- _bt_upgrademetapage(metapg);
- metad->btm_fastroot = itup_blkno;
- metad->btm_fastlevel = lpageop->btpo.level;
- MarkBufferDirty(metabuf);
- }
+ if (BufferIsValid(metabuf))
+ {
+ /* upgrade meta-page if needed */
+ if (metad->btm_version < BTREE_NOVAC_VERSION)
+ _bt_upgrademetapage(metapg);
+ metad->btm_fastroot = itup_blkno;
+ metad->btm_fastlevel = lpageop->btpo.level;
+ MarkBufferDirty(metabuf);
+ }
- /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
- if (BufferIsValid(cbuf))
- {
- Page cpage = BufferGetPage(cbuf);
- BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+ if (BufferIsValid(cbuf))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
- Assert(P_INCOMPLETE_SPLIT(cpageop));
- cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
- MarkBufferDirty(cbuf);
- }
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
- /*
- * Cache the block information if we just inserted into the rightmost
- * leaf page of the index and it's not the root page. For very small
- * index where root is also the leaf, there is no point trying for any
- * optimization.
- */
- if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
- cachedBlock = BufferGetBlockNumber(buf);
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ xl_btree_metadata xlmeta;
+ uint8 xlinfo;
+ XLogRecPtr recptr;
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- xl_btree_insert xlrec;
- xl_btree_metadata xlmeta;
- uint8 xlinfo;
- XLogRecPtr recptr;
+ xlrec.offnum = itup_off;
- xlrec.offnum = itup_off;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+ if (P_ISLEAF(lpageop))
+ xlinfo = XLOG_BTREE_INSERT_LEAF;
+ else
+ {
+ /*
+ * Register the left child whose INCOMPLETE_SPLIT flag was
+ * cleared.
+ */
+ XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
- if (P_ISLEAF(lpageop))
- xlinfo = XLOG_BTREE_INSERT_LEAF;
- else
- {
- /*
- * Register the left child whose INCOMPLETE_SPLIT flag was
- * cleared.
- */
- XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+ xlinfo = XLOG_BTREE_INSERT_UPPER;
+ }
- xlinfo = XLOG_BTREE_INSERT_UPPER;
- }
+ if (BufferIsValid(metabuf))
+ {
+ Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+ xlmeta.version = metad->btm_version;
+ xlmeta.root = metad->btm_root;
+ xlmeta.level = metad->btm_level;
+ xlmeta.fastroot = metad->btm_fastroot;
+ xlmeta.fastlevel = metad->btm_fastlevel;
+ xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+ xlmeta.last_cleanup_num_heap_tuples =
+ metad->btm_last_cleanup_num_heap_tuples;
+
+ XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+ XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+ xlinfo = XLOG_BTREE_INSERT_META;
+ }
- if (BufferIsValid(metabuf))
- {
- Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
- xlmeta.version = metad->btm_version;
- xlmeta.root = metad->btm_root;
- xlmeta.level = metad->btm_level;
- xlmeta.fastroot = metad->btm_fastroot;
- xlmeta.fastlevel = metad->btm_fastlevel;
- xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
- xlmeta.last_cleanup_num_heap_tuples =
- metad->btm_last_cleanup_num_heap_tuples;
-
- XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
- XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
- xlinfo = XLOG_BTREE_INSERT_META;
- }
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ recptr = XLogInsert(RM_BTREE_ID, xlinfo);
- recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+ if (BufferIsValid(metabuf))
+ {
+ PageSetLSN(metapg, recptr);
+ }
+ if (BufferIsValid(cbuf))
+ {
+ PageSetLSN(BufferGetPage(cbuf), recptr);
+ }
- if (BufferIsValid(metabuf))
- {
- PageSetLSN(metapg, recptr);
- }
- if (BufferIsValid(cbuf))
- {
- PageSetLSN(BufferGetPage(cbuf), recptr);
+ PageSetLSN(page, recptr);
}
- PageSetLSN(page, recptr);
+ END_CRIT_SECTION();
+ }
+ else
+ {
+ /*
+ * Insert new tuple on place of existing posting tuple.
+ * Delete old posting tuple, and insert updated tuple instead.
+ *
+ * If split was needed, both neworigtup and newrighttup are initialized
+ * and both will be inserted, otherwise newrighttup is NULL.
+ *
+ * It only can happen on leaf page.
+ */
+ elog(DEBUG4, "_bt_insertonpg. _bt_delete_and_insert %s", RelationGetRelationName(rel));
+ _bt_delete_and_insert(rel, buf, page, neworigtup, newrighttup, newitemoff);
}
- END_CRIT_SECTION();
+ /*
+ * Cache the block information if we just inserted into the rightmost
+ * leaf page of the index and it's not the root page. For very small
+ * index where root is also the leaf, there is no point trying for any
+ * optimization.
+ */
+ if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
+ cachedBlock = BufferGetBlockNumber(buf);
/* release buffers */
if (BufferIsValid(metabuf))
@@ -1214,7 +1511,8 @@ _bt_insertonpg(Relation rel,
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple lefttup, IndexTuple righttup)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1534,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replaceitemoff = InvalidOffsetNumber;
+ Size replaceitemsz;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1543,24 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * If we're working with splitted posting tuple,
+ * new tuple is actually contained in righttup posting list
+ */
+ if (righttup)
+ {
+ newitem = righttup;
+ newitemsz = MAXALIGN(IndexTupleSize(righttup));
+
+ /*
+ * actual insertion is a replacement of origtup with lefttup
+ * and insertion of righttup (as newitem) next to it.
+ */
+ replaceitemoff = newitemoff;
+ replaceitemsz = MAXALIGN(IndexTupleSize(lefttup));
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1275,7 +1593,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* (but not always) redundant information.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
- newitem, &newitemonleft);
+ newitem, replaceitemoff, replaceitemsz,
+ lefttup, &newitemonleft);
/* Allocate temp buffer for leftpage */
leftpage = PageGetTempPage(origpage);
@@ -1364,6 +1683,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
/* incoming tuple will become last on left page */
lastleft = newitem;
}
+ else if (!newitemonleft && newitemoff == firstright && lefttup)
+ {
+ /*
+ * if newitem is first on the right page
+ * and split posting tuple handle is reuqired,
+ * lastleft will be replaced with lefttup,
+ * so use it here
+ */
+ elog(DEBUG4, "lastleft = lefttup firstright %d", firstright);
+ lastleft = lefttup;
+ }
else
{
OffsetNumber lastleftoff;
@@ -1480,6 +1810,39 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ if (i == replaceitemoff)
+ {
+ if (replaceitemoff <= firstright)
+ {
+ elog(DEBUG4, "_bt_split left. replaceitem block %u %s replaceitemoff %d leftoff %d",
+ origpagenumber, RelationGetRelationName(rel), replaceitemoff, leftoff);
+ if (!_bt_pgaddtup(leftpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, leftoff))
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add new item to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+ leftoff = OffsetNumberNext(leftoff);
+ }
+ else
+ {
+ elog(DEBUG4, "_bt_split right. replaceitem block %u %s replaceitemoff %d newitemoff %d",
+ origpagenumber, RelationGetRelationName(rel), replaceitemoff, newitemoff);
+ elog(DEBUG4, "_bt_split right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
+
+ if (!_bt_pgaddtup(rightpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, rightoff))
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add new item to the right sibling"
+ " while splitting block %u of index \"%s\", rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
+ }
+ rightoff = OffsetNumberNext(rightoff);
+ }
+ continue;
+ }
+
/* does new item belong before this one? */
if (i == newitemoff)
{
@@ -1497,13 +1860,14 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
}
else
{
+ elog(DEBUG4, "insert newitem to the right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
Assert(newitemoff >= firstright);
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
{
memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ " while splitting block %u of index \"%s\", rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
}
rightoff = OffsetNumberNext(rightoff);
}
@@ -1547,8 +1911,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
{
memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ " while splitting block %u of index \"%s\" rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
}
rightoff = OffsetNumberNext(rightoff);
}
@@ -1837,7 +2201,7 @@ _bt_insert_parent(Relation rel,
/* Recursively update the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
- is_only);
+ is_only, 0);
/* be tidy */
pfree(new_item);
@@ -2290,3 +2654,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+
+ /*
+ * Delete dead tuples if any.
+ * We cannot simply skip them in the cycle below, because it's neccessary
+ * to generate special Xlog record containing such tuples to compute
+ * latestRemovedXid on a standby server later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during compression");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..86c662d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf..de0af9e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ elog(DEBUG4, "_bt_compare_posting in_posting_offset %d", *in_posting_offset);
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ key->scantid) < 0);
+ Assert(ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxTID(itup)) < 0);
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +723,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1456,6 +1573,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1608,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1656,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1664,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1706,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1736,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1750,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1764,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..b058599 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..c492b04 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,11 @@ typedef struct
int nsplits; /* current number of splits */
SplitPoint *splits; /* all candidate split points for page */
int interval; /* current range of acceptable split points */
+
+ /* fields only valid when insert splitted posting tuple */
+ OffsetNumber replaceitemoff;
+ IndexTuple replaceitem;
+ Size replaceitemsz;
} FindSplitData;
static void _bt_recsplitloc(FindSplitData *state,
@@ -129,6 +134,9 @@ _bt_findsplitloc(Relation rel,
OffsetNumber newitemoff,
Size newitemsz,
IndexTuple newitem,
+ OffsetNumber replaceitemoff,
+ Size replaceitemsz,
+ IndexTuple replaceitem,
bool *newitemonleft)
{
BTPageOpaque opaque;
@@ -183,6 +191,10 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ state.replaceitemoff = replaceitemoff;
+ state.replaceitemsz = replaceitemsz;
+ state.replaceitem = replaceitem;
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -207,7 +219,17 @@ _bt_findsplitloc(Relation rel,
Size itemsz;
itemid = PageGetItemId(page, offnum);
- itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+ /* use size of replacing item for calculations */
+ if (offnum == replaceitemoff)
+ {
+ itemsz = replaceitemsz + sizeof(ItemIdData);
+ olddataitemstotal = state.olddataitemstotal = state.olddataitemstotal
+ - MAXALIGN(ItemIdGetLength(itemid))
+ + replaceitemsz;
+ }
+ else
+ itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
/*
* When item offset number is not newitemoff, neither side of the
@@ -466,9 +488,13 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ }
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
@@ -492,12 +518,12 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
- if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
- MAXALIGN(sizeof(ItemPointerData)));
- else
- leftfree -= (int16) firstrightitemsz;
+ leftfree -= (int16) firstrightitemsz;
/* account for the new item */
if (newitemonleft)
@@ -1066,13 +1092,20 @@ static inline IndexTuple
_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
{
ItemId itemid;
+ OffsetNumber offset;
if (split->newitemonleft && split->firstoldonright == state->newitemoff)
return state->newitem;
- itemid = PageGetItemId(state->page,
- OffsetNumberPrev(split->firstoldonright));
- return (IndexTuple) PageGetItem(state->page, itemid);
+ offset = OffsetNumberPrev(split->firstoldonright);
+ if (offset == state->replaceitemoff)
+ return state->replaceitem;
+ else
+ {
+ itemid = PageGetItemId(state->page,
+ OffsetNumberPrev(split->firstoldonright));
+ return (IndexTuple) PageGetItem(state->page, itemid);
+ }
}
/*
@@ -1086,6 +1119,11 @@ _bt_split_firstright(FindSplitData *state, SplitPoint *split)
if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
return state->newitem;
- itemid = PageGetItemId(state->page, split->firstoldonright);
- return (IndexTuple) PageGetItem(state->page, itemid);
+ if (split->firstoldonright == state->replaceitemoff)
+ return state->replaceitem;
+ else
+ {
+ itemid = PageGetItemId(state->page, split->firstoldonright);
+ return (IndexTuple) PageGetItem(state->page, itemid);
+ }
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..c56f5ab 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2112,6 +2118,7 @@ btbuildphasename(int64 phasenum)
* returning an enlarged tuple to caller when truncation + TOAST compression
* ends up enlarging the final datum.
*/
+
IndexTuple
_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
BTScanInsert itup_key)
@@ -2124,6 +2131,17 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
ItemPointer pivotheaptid;
Size newsize;
+ elog(DEBUG4, "_bt_truncate left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+ BTreeTupleIsPosting(lastleft)?BTreeTupleGetNPosting(lastleft):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+ BTreeTupleIsPosting(firstright)?BTreeTupleGetNPosting(firstright):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(firstright)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(firstright)));
/*
* We should only ever truncate leaf index tuples. It's never okay to
* truncate a second time.
@@ -2145,6 +2163,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2189,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2198,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2256,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2267,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2285,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2294,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2385,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2489,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2544,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2571,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2623,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2651,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..538a6bc 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..3064afb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -501,6 +662,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -739,6 +908,8 @@ extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
*/
extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ OffsetNumber replaceitemoff, Size replaceitemsz,
+ IndexTuple replaceitem,
bool *newitemonleft);
/*
@@ -763,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +948,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +988,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1003,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..4b615e0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.Could you explain more about these cases?
Now I don't understand the problem.
Maybe there is no problem.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.
I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.
I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.
I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).
v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).Is it possible that DEAD tuple to delete was smaller than itup?
I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.
How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.I agree.
Should I rename all related names of functions and variables in the patch?
Please rename them when convenient.
--
Peter Geoghegan
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Now the algorithm is the following:
- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,
That sounds like the right way to do it.
- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.
That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.
It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.
I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.
I think that we should start to think about efficient WAL-logging now.
In the meantime, I'll run more stress-tests.
As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.
--
Peter Geoghegan
20.08.2019 4:04, Peter Geoghegan wrote:
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.I think that we should start to think about efficient WAL-logging now.
Thank you for the review.
The new version v8 is attached. Compared to previous version, this patch
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records
now only contain data
about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on
nbtxlog changes.
Also in this patch I renamed all 'compress' keywords to 'deduplicate'
and did minor cleanup
of outdated comments.
I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v8-0001-Deduplication-in-nbtree.patchtext/x-patch; name=v8-0001-Deduplication-in-nbtree.patchDownload
commit d73c1b8e10177dfb55ff1b1bac999f85d2a0298d
Author: Anastasia <a.lubennikova@postgrespro.ru>
Date: Wed Aug 21 20:00:54 2019 +0300
v8-0001-Deduplication-in-nbtree.patch
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..ddc511a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * deduplication will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 48d19be..9af59c1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,15 +47,17 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page);
+ bool split_only_page, int in_posting_offset);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple lefttup, IndexTuple righttup);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTDeduplicateState *deduplicateState);
+static void _bt_deduplicate_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +299,13 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false,
+ insertstate.in_posting_offset);
}
else
{
@@ -435,6 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +765,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to deduplicate the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_deduplicate_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * deduplication avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +926,75 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ *
+ * If original posting tuple was split, 'newitup' represents left part of
+ * original tuple and 'newitupright' is it's right part, that must be inserted
+ * next to newitemoff.
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+void
+_bt_delete_and_insert(Buffer buf,
+ Page page,
+ IndexTuple newitup, IndexTuple newitupright,
+ OffsetNumber newitemoff, bool need_xlog)
+{
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert posting item in index");
+
+ if (newitupright)
+ {
+ if (!_bt_pgaddtup(page, MAXALIGN(IndexTupleSize(newitupright)),
+ newitupright, OffsetNumberNext(newitemoff)))
+ elog(ERROR, "failed to insert posting item in index");
+ }
+
+ if (BufferIsValid(buf))
+ {
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (need_xlog)
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+ xlrec.righttupoffset = 1;
+ if (newitupright)
+ xlrec.righttupoffset = IndexTupleSize(newitup);
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) newitup, IndexTupleSize(newitup));
+ if (newitupright)
+ XLogRegisterBufData(0, (char *) newitupright, IndexTupleSize(newitupright));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+
+ PageSetLSN(page, recptr);
+ }
+ }
+ END_CRIT_SECTION();
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -936,11 +1031,16 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page)
+ bool split_only_page,
+ int in_posting_offset)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple origtup;
+ IndexTuple neworigtup = NULL;
+ IndexTuple newrighttup = NULL;
+ bool need_split = false;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -965,13 +1065,184 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * TODO Think of moving this to a separate function.
+ *
+ * TODO possible optimization:
+ * if original posting tuple is dead,
+ * reset in_posting_offset and handle itup as a regular tuple
+ */
+ if (in_posting_offset)
+ {
+ /* get old posting tuple */
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ ItemPointerData *ipd;
+ int nipd, nipd_right;
+ bool need_posting_split = false;
+
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /* check if posting tuple must be splitted */
+ if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + sizeof(ItemPointerData))
+ need_posting_split = true;
+
+ /*
+ * If page split is needed, always split posting tuple.
+ * Probably that is not the most optimal,
+ * but it allows to simplify _bt_split code.
+ *
+ * TODO Does this decision have any significant drawbacks?
+ */
+ if (PageGetFreeSpace(page) < sizeof(ItemPointerData))
+ need_posting_split = true;
+
+ /*
+ * Handle corner cases (1)
+ * - itup TID is smaller than leftmost orightup TID
+ */
+ if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ BTreeTupleGetHeapTID(origtup)) < 0)
+ {
+ if (need_posting_split)
+ {
+ /*
+ * cannot avoid split, so no need in trying to fit itup into posting list.
+ * handle itup insertion as regular tuple insertion
+ */
+ elog(DEBUG4, "split posting tuple. itup is to the left of origtup");
+ in_posting_offset = InvalidOffsetNumber;
+ newitemoff = OffsetNumberPrev(newitemoff);
+ }
+ else
+ {
+ ipd = palloc0(nipd + 1);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+ neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+ pfree(ipd);
+
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+ BTreeTupleGetMaxTID(neworigtup)) < 0);
+ }
+ }
+
+ /*
+ * Handle corner cases (2)
+ * - itup TID is larger than rightmost orightup TID
+ */
+ if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+ BTreeTupleGetHeapTID(itup)) < 0)
+ {
+ if (need_posting_split)
+ {
+ /*
+ * cannot avoid split, so no need in trying to fit itup into posting list.
+ * handle itup insertion as regular tuple insertion
+ */
+ elog(DEBUG4, "split posting tuple. itup is to the right of origtup");
+ in_posting_offset = InvalidOffsetNumber;
+ }
+ else
+ {
+ ipd = palloc0(nipd + 1);
+ /* insert new item pointer */
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd, BTreeTupleGetPosting(origtup), sizeof(ItemPointerData) * nipd);
+ memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+ neworigtup = BTreeFormPostingTuple(origtup, ipd, nipd+1);
+ pfree(ipd);
+
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+ BTreeTupleGetMaxTID(neworigtup)) < 0);
+ }
+ }
+
+ /*
+ * itup TID belongs to TID range of origtup posting list
+ *
+ * Split posting tuple into two halves.
+ *
+ * neworigtup (left) tuple contains all item pointers less than the new one and
+ * newrighttup tuple contains new item pointer and all to the right.
+ */
+ if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ BTreeTupleGetHeapTID(origtup)) > 0
+ &&
+ ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+ BTreeTupleGetHeapTID(itup)) > 0)
+ {
+ neworigtup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+
+ elog(DEBUG4, "split posting tuple in_posting_offset %d nipd %d nipd_right %d",
+ in_posting_offset, nipd, nipd_right);
+
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newrighttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+ BTreeTupleGetHeapTID(newrighttup)) < 0);
+ pfree(ipd);
+
+ elog(DEBUG4, "left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+ BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ BTreeTupleIsPosting(newrighttup)?BTreeTupleGetNPosting(newrighttup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(newrighttup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(newrighttup)));
+
+ /*
+ * check if splitted tuple still fit into original page
+ * TODO should we add sizeof(ItemIdData) in this check?
+ */
+ if (PageGetFreeSpace(page) < (MAXALIGN(IndexTupleSize(neworigtup))
+ + MAXALIGN(IndexTupleSize(newrighttup))
+ - MAXALIGN(IndexTupleSize(origtup))))
+ need_split = true;
+ }
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
* so this comparison is correct even though we appear to be accounting
* only for the item and not for its line pointer.
*/
- if (PageGetFreeSpace(page) < itemsz)
+ if (PageGetFreeSpace(page) < itemsz || need_split)
{
bool is_root = P_ISROOT(lpageop);
bool is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(lpageop);
@@ -996,7 +1267,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ neworigtup, newrighttup);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1033,142 +1305,161 @@ _bt_insertonpg(Relation rel,
itup_off = newitemoff;
itup_blkno = BufferGetBlockNumber(buf);
- /*
- * If we are doing this insert because we split a page that was the
- * only one on its tree level, but was not the root, it may have been
- * the "fast root". We need to ensure that the fast root link points
- * at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
- */
- if (split_only_page)
+ if (!in_posting_offset)
{
- Assert(!P_ISLEAF(lpageop));
-
- metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
- metapg = BufferGetPage(metabuf);
- metad = BTPageGetMeta(metapg);
-
- if (metad->btm_fastlevel >= lpageop->btpo.level)
+ /*
+ * If we are doing this insert because we split a page that was the
+ * only one on its tree level, but was not the root, it may have been
+ * the "fast root". We need to ensure that the fast root link points
+ * at or above the current page. We can safely acquire a lock on the
+ * metapage here --- see comments for _bt_newroot().
+ */
+ if (split_only_page)
{
- /* no update wanted */
- _bt_relbuf(rel, metabuf);
- metabuf = InvalidBuffer;
- }
- }
-
- /*
- * Every internal page should have exactly one negative infinity item
- * at all times. Only _bt_split() and _bt_newroot() should add items
- * that become negative infinity items through truncation, since
- * they're the only routines that allocate new internal pages. Do not
- * allow a retail insertion of a new item at the negative infinity
- * offset.
- */
- if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
- elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ Assert(!P_ISLEAF(lpageop));
- /* Do the update. No ereport(ERROR) until changes are logged */
- START_CRIT_SECTION();
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
- if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
- elog(PANIC, "failed to add new item to block %u in index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ if (metad->btm_fastlevel >= lpageop->btpo.level)
+ {
+ /* no update wanted */
+ _bt_relbuf(rel, metabuf);
+ metabuf = InvalidBuffer;
+ }
+ }
- MarkBufferDirty(buf);
+ /*
+ * Every internal page should have exactly one negative infinity item
+ * at all times. Only _bt_split() and _bt_newroot() should add items
+ * that become negative infinity items through truncation, since
+ * they're the only routines that allocate new internal pages. Do not
+ * allow a retail insertion of a new item at the negative infinity
+ * offset.
+ */
+ if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+ elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ /* Do the update. No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+ elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
- if (BufferIsValid(metabuf))
- {
- /* upgrade meta-page if needed */
- if (metad->btm_version < BTREE_NOVAC_VERSION)
- _bt_upgrademetapage(metapg);
- metad->btm_fastroot = itup_blkno;
- metad->btm_fastlevel = lpageop->btpo.level;
- MarkBufferDirty(metabuf);
- }
+ if (BufferIsValid(metabuf))
+ {
+ /* upgrade meta-page if needed */
+ if (metad->btm_version < BTREE_NOVAC_VERSION)
+ _bt_upgrademetapage(metapg);
+ metad->btm_fastroot = itup_blkno;
+ metad->btm_fastlevel = lpageop->btpo.level;
+ MarkBufferDirty(metabuf);
+ }
- /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
- if (BufferIsValid(cbuf))
- {
- Page cpage = BufferGetPage(cbuf);
- BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+ if (BufferIsValid(cbuf))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
- Assert(P_INCOMPLETE_SPLIT(cpageop));
- cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
- MarkBufferDirty(cbuf);
- }
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
- /*
- * Cache the block information if we just inserted into the rightmost
- * leaf page of the index and it's not the root page. For very small
- * index where root is also the leaf, there is no point trying for any
- * optimization.
- */
- if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
- cachedBlock = BufferGetBlockNumber(buf);
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ xl_btree_metadata xlmeta;
+ uint8 xlinfo;
+ XLogRecPtr recptr;
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- xl_btree_insert xlrec;
- xl_btree_metadata xlmeta;
- uint8 xlinfo;
- XLogRecPtr recptr;
+ xlrec.offnum = itup_off;
+ xlrec.righttupoffset = 0;
- xlrec.offnum = itup_off;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+ if (P_ISLEAF(lpageop))
+ xlinfo = XLOG_BTREE_INSERT_LEAF;
+ else
+ {
+ /*
+ * Register the left child whose INCOMPLETE_SPLIT flag was
+ * cleared.
+ */
+ XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
- if (P_ISLEAF(lpageop))
- xlinfo = XLOG_BTREE_INSERT_LEAF;
- else
- {
- /*
- * Register the left child whose INCOMPLETE_SPLIT flag was
- * cleared.
- */
- XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+ xlinfo = XLOG_BTREE_INSERT_UPPER;
+ }
- xlinfo = XLOG_BTREE_INSERT_UPPER;
- }
+ if (BufferIsValid(metabuf))
+ {
+ Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+ xlmeta.version = metad->btm_version;
+ xlmeta.root = metad->btm_root;
+ xlmeta.level = metad->btm_level;
+ xlmeta.fastroot = metad->btm_fastroot;
+ xlmeta.fastlevel = metad->btm_fastlevel;
+ xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+ xlmeta.last_cleanup_num_heap_tuples =
+ metad->btm_last_cleanup_num_heap_tuples;
+
+ XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+ XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+ xlinfo = XLOG_BTREE_INSERT_META;
+ }
- if (BufferIsValid(metabuf))
- {
- Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
- xlmeta.version = metad->btm_version;
- xlmeta.root = metad->btm_root;
- xlmeta.level = metad->btm_level;
- xlmeta.fastroot = metad->btm_fastroot;
- xlmeta.fastlevel = metad->btm_fastlevel;
- xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
- xlmeta.last_cleanup_num_heap_tuples =
- metad->btm_last_cleanup_num_heap_tuples;
-
- XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
- XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
- xlinfo = XLOG_BTREE_INSERT_META;
- }
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ recptr = XLogInsert(RM_BTREE_ID, xlinfo);
- recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+ if (BufferIsValid(metabuf))
+ {
+ PageSetLSN(metapg, recptr);
+ }
+ if (BufferIsValid(cbuf))
+ {
+ PageSetLSN(BufferGetPage(cbuf), recptr);
+ }
- if (BufferIsValid(metabuf))
- {
- PageSetLSN(metapg, recptr);
- }
- if (BufferIsValid(cbuf))
- {
- PageSetLSN(BufferGetPage(cbuf), recptr);
+ PageSetLSN(page, recptr);
}
- PageSetLSN(page, recptr);
+ END_CRIT_SECTION();
+ }
+ else
+ {
+ /*
+ * Insert new tuple on place of existing posting tuple.
+ * Delete old posting tuple, and insert updated tuple instead.
+ *
+ * If split was needed, both neworigtup and newrighttup are initialized
+ * and both will be inserted, otherwise newrighttup is NULL.
+ *
+ * It only can happen on leaf page.
+ */
+ elog(DEBUG4, "_bt_insertonpg. _bt_delete_and_insert %s", RelationGetRelationName(rel));
+ _bt_delete_and_insert(buf, page, neworigtup,
+ newrighttup, newitemoff, RelationNeedsWAL(rel));
}
- END_CRIT_SECTION();
+ /*
+ * Cache the block information if we just inserted into the rightmost
+ * leaf page of the index and it's not the root page. For very small
+ * index where root is also the leaf, there is no point trying for any
+ * optimization.
+ */
+ if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
+ cachedBlock = BufferGetBlockNumber(buf);
/* release buffers */
if (BufferIsValid(metabuf))
@@ -1214,7 +1505,8 @@ _bt_insertonpg(Relation rel,
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple lefttup, IndexTuple righttup)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1528,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replaceitemoff = InvalidOffsetNumber;
+ Size replaceitemsz;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1537,24 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * If we're working with splitted posting tuple,
+ * new tuple is actually contained in righttup posting list
+ */
+ if (righttup)
+ {
+ newitem = righttup;
+ newitemsz = MAXALIGN(IndexTupleSize(righttup));
+
+ /*
+ * actual insertion is a replacement of origtup with lefttup
+ * and insertion of righttup (as newitem) next to it.
+ */
+ replaceitemoff = newitemoff;
+ replaceitemsz = MAXALIGN(IndexTupleSize(lefttup));
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1275,7 +1587,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* (but not always) redundant information.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
- newitem, &newitemonleft);
+ newitem, replaceitemoff, replaceitemsz,
+ lefttup, &newitemonleft);
/* Allocate temp buffer for leftpage */
leftpage = PageGetTempPage(origpage);
@@ -1364,6 +1677,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
/* incoming tuple will become last on left page */
lastleft = newitem;
}
+ else if (!newitemonleft && newitemoff == firstright && lefttup)
+ {
+ /*
+ * if newitem is first on the right page
+ * and split posting tuple handle is reuqired,
+ * lastleft will be replaced with lefttup,
+ * so use it here
+ */
+ elog(DEBUG4, "lastleft = lefttup firstright %d", firstright);
+ lastleft = lefttup;
+ }
else
{
OffsetNumber lastleftoff;
@@ -1480,6 +1804,39 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ if (i == replaceitemoff)
+ {
+ if (replaceitemoff <= firstright)
+ {
+ elog(DEBUG4, "_bt_split left. replaceitem block %u %s replaceitemoff %d leftoff %d",
+ origpagenumber, RelationGetRelationName(rel), replaceitemoff, leftoff);
+ if (!_bt_pgaddtup(leftpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, leftoff))
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add new item to the left sibling"
+ " while splitting block %u of index \"%s\"",
+ origpagenumber, RelationGetRelationName(rel));
+ }
+ leftoff = OffsetNumberNext(leftoff);
+ }
+ else
+ {
+ elog(DEBUG4, "_bt_split right. replaceitem block %u %s replaceitemoff %d newitemoff %d",
+ origpagenumber, RelationGetRelationName(rel), replaceitemoff, newitemoff);
+ elog(DEBUG4, "_bt_split right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
+
+ if (!_bt_pgaddtup(rightpage, MAXALIGN(IndexTupleSize(lefttup)), lefttup, rightoff))
+ {
+ memset(rightpage, 0, BufferGetPageSize(rbuf));
+ elog(ERROR, "failed to add new item to the right sibling"
+ " while splitting block %u of index \"%s\", rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
+ }
+ rightoff = OffsetNumberNext(rightoff);
+ }
+ continue;
+ }
+
/* does new item belong before this one? */
if (i == newitemoff)
{
@@ -1497,13 +1854,14 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
}
else
{
+ elog(DEBUG4, "insert newitem to the right. i %d, maxoff %d, rightoff %d", i, maxoff, rightoff);
Assert(newitemoff >= firstright);
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, rightoff))
{
memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ " while splitting block %u of index \"%s\", rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
}
rightoff = OffsetNumberNext(rightoff);
}
@@ -1547,8 +1905,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
{
memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
- " while splitting block %u of index \"%s\"",
- origpagenumber, RelationGetRelationName(rel));
+ " while splitting block %u of index \"%s\" rightoff %d",
+ origpagenumber, RelationGetRelationName(rel), rightoff);
}
rightoff = OffsetNumberNext(rightoff);
}
@@ -1652,6 +2010,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.level = ropaque->btpo.level;
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.replaceitemoff = replaceitemoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1681,6 +2040,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
item = (IndexTuple) PageGetItem(origpage, itemid);
XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
+ if (replaceitemoff)
+ XLogRegisterBufData(0, (char *) lefttup, MAXALIGN(IndexTupleSize(lefttup)));
+
/*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). The whole right page will be recreated.
@@ -1835,7 +2197,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
- is_only);
+ is_only, 0);
/* be tidy */
pfree(new_item);
@@ -2304,3 +2666,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (posting or not) to the page, while applying deduplication
+ * to it.
+ */
+static void
+insert_itupprev_to_page(Page page, BTDeduplicateState *deduplicateState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (deduplicateState->ntuples == 0)
+ to_insert = deduplicateState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(deduplicateState->itupprev,
+ deduplicateState->ipd,
+ deduplicateState->ntuples);
+ to_insert = postingtuple;
+ pfree(deduplicateState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. deduplicateState->ntuples %d IndexTupleSize %zu free %zu",
+ deduplicateState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (deduplicateState->ntuples > 0)
+ pfree(to_insert);
+ deduplicateState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to deduplicate items to free some space.
+ *
+ * If deduplication was not applied, buffer contains old state of the page.
+ *
+ * It's expected that this function is called after lp_dead items were
+ * removed by _bt_vacuum_one_page(). In case some dead items are still left,
+ * this function cleans them up before applying deduplication.
+ */
+static void
+_bt_deduplicate_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_deduplication = false;
+ BTDeduplicateState *deduplicateState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_deduplication = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_deduplication)
+ return;
+
+ /* init state needed to build posting tuples */
+ deduplicateState = (BTDeduplicateState *) palloc0(sizeof(BTDeduplicateState));
+ deduplicateState->ipd = NULL;
+ deduplicateState->ntuples = 0;
+ deduplicateState->itupprev = NULL;
+ deduplicateState->maxitemsize = BTMaxItemSize(page);
+ deduplicateState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any.
+ * We cannot simply skip them in the cycle below, because it's neccessary
+ * to generate special Xlog record containing such tuples to compute
+ * latestRemovedXid on a standby server later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_deduplicate_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to collect them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (deduplicateState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, deduplicateState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (deduplicateState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(deduplicateState->itupprev)
+ + (deduplicateState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(deduplicateState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, deduplicateState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, deduplicateState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (deduplicateState->itupprev)
+ pfree(deduplicateState->itupprev);
+ deduplicateState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(deduplicateState->itupprev) <= deduplicateState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, deduplicateState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_deduplicate_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 18c6de2..bd41592 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed2..6282c6b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -497,7 +500,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -526,6 +530,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ elog(DEBUG4, "_bt_compare_posting in_posting_offset %d", *in_posting_offset);
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ key->scantid) < 0);
+ Assert(ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxTID(itup)) < 0);
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -658,61 +716,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1449,6 +1566,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1601,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1517,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1525,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1567,8 +1699,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1582,8 +1729,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1743,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1608,6 +1757,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e678690..7c3a42b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDeduplicateState *deduplicateState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1128,6 +1137,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDeduplicateState *deduplicateState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (deduplicateState->ntuples == 0)
+ to_insert = deduplicateState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(deduplicateState->itupprev,
+ deduplicateState->ipd,
+ deduplicateState->ntuples);
+ to_insert = postingtuple;
+ pfree(deduplicateState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (deduplicateState->ntuples > 0)
+ pfree(to_insert);
+ deduplicateState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in deduplicateState.
+ *
+ * Helper function for _bt_load() and _bt_deduplicate_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTDeduplicateState *deduplicateState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (deduplicateState->ntuples == 0)
+ {
+ deduplicateState->ipd = palloc0(deduplicateState->maxitemsize);
+
+ if (BTreeTupleIsPosting(deduplicateState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(deduplicateState->itupprev);
+ memcpy(deduplicateState->ipd,
+ BTreeTupleGetPosting(deduplicateState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ deduplicateState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(deduplicateState->ipd, deduplicateState->itupprev,
+ sizeof(ItemPointerData));
+ deduplicateState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(deduplicateState->ipd + deduplicateState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ deduplicateState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(deduplicateState->ipd + deduplicateState->ntuples, itup,
+ sizeof(ItemPointerData));
+ deduplicateState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_deduplication = false;
+ BTDeduplicateState *deduplicateState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_deduplication = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1362,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_deduplication)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init state needed to build posting tuples */
+ deduplicateState = (BTDeduplicateState *) palloc0(sizeof(BTDeduplicateState));
+ deduplicateState->ipd = NULL;
+ deduplicateState->ntuples = 0;
+ deduplicateState->itupprev = NULL;
+ deduplicateState->maxitemsize = 0;
+ deduplicateState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ deduplicateState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (deduplicateState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ deduplicateState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((deduplicateState->ntuples + 1) * sizeof(ItemPointerData) <
+ deduplicateState->maxpostingsize)
+ _bt_add_posting_item(deduplicateState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ deduplicateState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, deduplicateState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (deduplicateState->itupprev)
+ pfree(deduplicateState->itupprev);
+ deduplicateState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ deduplicateState->maxpostingsize = deduplicateState->maxitemsize -
+ IndexInfoFindDataOffset(deduplicateState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(deduplicateState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, deduplicateState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..c492b04 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -62,6 +62,11 @@ typedef struct
int nsplits; /* current number of splits */
SplitPoint *splits; /* all candidate split points for page */
int interval; /* current range of acceptable split points */
+
+ /* fields only valid when insert splitted posting tuple */
+ OffsetNumber replaceitemoff;
+ IndexTuple replaceitem;
+ Size replaceitemsz;
} FindSplitData;
static void _bt_recsplitloc(FindSplitData *state,
@@ -129,6 +134,9 @@ _bt_findsplitloc(Relation rel,
OffsetNumber newitemoff,
Size newitemsz,
IndexTuple newitem,
+ OffsetNumber replaceitemoff,
+ Size replaceitemsz,
+ IndexTuple replaceitem,
bool *newitemonleft)
{
BTPageOpaque opaque;
@@ -183,6 +191,10 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ state.replaceitemoff = replaceitemoff;
+ state.replaceitemsz = replaceitemsz;
+ state.replaceitem = replaceitem;
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -207,7 +219,17 @@ _bt_findsplitloc(Relation rel,
Size itemsz;
itemid = PageGetItemId(page, offnum);
- itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
+
+ /* use size of replacing item for calculations */
+ if (offnum == replaceitemoff)
+ {
+ itemsz = replaceitemsz + sizeof(ItemIdData);
+ olddataitemstotal = state.olddataitemstotal = state.olddataitemstotal
+ - MAXALIGN(ItemIdGetLength(itemid))
+ + replaceitemsz;
+ }
+ else
+ itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
/*
* When item offset number is not newitemoff, neither side of the
@@ -466,9 +488,13 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ }
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
@@ -492,12 +518,12 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
- if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
- MAXALIGN(sizeof(ItemPointerData)));
- else
- leftfree -= (int16) firstrightitemsz;
+ leftfree -= (int16) firstrightitemsz;
/* account for the new item */
if (newitemonleft)
@@ -1066,13 +1092,20 @@ static inline IndexTuple
_bt_split_lastleft(FindSplitData *state, SplitPoint *split)
{
ItemId itemid;
+ OffsetNumber offset;
if (split->newitemonleft && split->firstoldonright == state->newitemoff)
return state->newitem;
- itemid = PageGetItemId(state->page,
- OffsetNumberPrev(split->firstoldonright));
- return (IndexTuple) PageGetItem(state->page, itemid);
+ offset = OffsetNumberPrev(split->firstoldonright);
+ if (offset == state->replaceitemoff)
+ return state->replaceitem;
+ else
+ {
+ itemid = PageGetItemId(state->page,
+ OffsetNumberPrev(split->firstoldonright));
+ return (IndexTuple) PageGetItem(state->page, itemid);
+ }
}
/*
@@ -1086,6 +1119,11 @@ _bt_split_firstright(FindSplitData *state, SplitPoint *split)
if (!split->newitemonleft && split->firstoldonright == state->newitemoff)
return state->newitem;
- itemid = PageGetItemId(state->page, split->firstoldonright);
- return (IndexTuple) PageGetItem(state->page, itemid);
+ if (split->firstoldonright == state->replaceitemoff)
+ return state->replaceitem;
+ else
+ {
+ itemid = PageGetItemId(state->page, split->firstoldonright);
+ return (IndexTuple) PageGetItem(state->page, itemid);
+ }
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..c506cca 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2112,6 +2118,7 @@ btbuildphasename(int64 phasenum)
* returning an enlarged tuple to caller when truncation + TOAST compression
* ends up enlarging the final datum.
*/
+
IndexTuple
_bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
BTScanInsert itup_key)
@@ -2124,6 +2131,17 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
ItemPointer pivotheaptid;
Size newsize;
+ elog(DEBUG4, "_bt_truncate left N %d (%u,%u) to (%u,%u), right N %d (%u,%u) to (%u,%u) ",
+ BTreeTupleIsPosting(lastleft)?BTreeTupleGetNPosting(lastleft):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(lastleft)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(lastleft)),
+ BTreeTupleIsPosting(firstright)?BTreeTupleGetNPosting(firstright):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(firstright)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(firstright)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(firstright)));
/*
* We should only ever truncate leaf index tuples. It's never okay to
* truncate a second time.
@@ -2145,6 +2163,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2189,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2198,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a posting tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2256,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2267,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2285,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2294,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2385,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use deduplication, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2489,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2544,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2571,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2623,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2651,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..06ac688 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -174,16 +174,39 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
*/
if (!isleaf)
_bt_clear_incomplete_split(record, 1);
+
if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
{
- Size datalen;
- char *datapos = XLogRecGetBlockData(record, 0, &datalen);
-
page = BufferGetPage(buffer);
+ if (isleaf && xlrec->righttupoffset)
+ {
+ Size datalen, lefttuplen;
+ char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ IndexTuple lefttup = NULL;
+ IndexTuple righttup = NULL;
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ lefttup = (IndexTuple) datapos;
+
+ if (xlrec->righttupoffset > 1)
+ {
+ lefttuplen = xlrec->righttupoffset;
+ righttup = (IndexTuple) (datapos + lefttuplen);
+ }
+ else
+ lefttuplen = datalen;
+
+ _bt_delete_and_insert(InvalidBuffer, page,
+ lefttup, righttup, xlrec->offnum, false);
+ }
+ else
+ {
+ Size datalen;
+ char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,9 +288,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ replaceitem = NULL;
Size newitemsz = 0,
- left_hikeysz = 0;
+ left_hikeysz = 0,
+ replaceitemsz = 0;
Page newlpage;
OffsetNumber leftoff;
@@ -287,6 +312,13 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datapos += left_hikeysz;
datalen -= left_hikeysz;
+ if (xlrec->replaceitemoff)
+ {
+ replaceitem = (IndexTuple) datapos;
+ replaceitemsz = MAXALIGN(IndexTupleSize(replaceitem));
+ datapos += replaceitemsz;
+ datalen -= replaceitemsz;
+ }
Assert(datalen == 0);
newlpage = PageGetTempPageCopySpecial(lpage);
@@ -304,6 +336,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == xlrec->replaceitemoff)
+ {
+ if (PageAddItem(newlpage, (Item) replaceitem, replaceitemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -386,8 +427,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +519,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7e54c45..d76fbe9 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of deduplication never applies to unique indexes or indexes
+ * with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDeduplicateState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDeduplicateState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -497,6 +658,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -563,6 +730,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -575,7 +744,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -729,12 +898,17 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_delete_and_insert(Buffer buf, Page page,
+ IndexTuple newitup, IndexTuple newitupright,
+ OffsetNumber newitemoff, bool need_xlog);
/*
* prototypes for functions in nbtsplitloc.c
*/
extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ OffsetNumber replaceitemoff, Size replaceitemsz,
+ IndexTuple replaceitem,
bool *newitemonleft);
/*
@@ -759,6 +933,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -771,6 +947,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -809,6 +987,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -821,5 +1002,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTDeduplicateState *deduplicateState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..312e780 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -64,13 +64,16 @@ typedef struct xl_btree_metadata
* Backup Blk 0: original page (data contains the inserted tuple)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * INSERT_LEAF case
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ Size righttupoffset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, righttupoffset) + sizeof(Size))
/*
* On insert with split, we save all the items going into the right sibling
@@ -113,9 +116,10 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber replaceitemoff; /* offset of the posting item to replace with (replaceitem) */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, replaceitemoff) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +177,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.
Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:
DEBUG: finished verifying presence of 1500000 tuples from table
"customer" with bitset 51.09% set
ERROR: could not find tuple using search from root page in index
"idx_customer_nationkey2"
I've been running my standard amcheck query with these databases, which is:
SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;
There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.
I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.
I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:
regression=# \l+
List
of databases
Name | Owner | Encoding | Collate | Ctype | Access
privileges | Size | Tablespace | Description
------------+-------+----------+------------+------------+-------------------+---------+------------+--------------------------------------------
land | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 6425 MB | pg_default |
mgd | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 61 GB | pg_default |
postgres | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 7753 kB | pg_default | default administrative connection
database
regression | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 886 MB | pg_default |
template0 | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
+| 7609 kB | pg_default | unmodifiable empty database
| | | | | pg=CTc/pg
| | |
template1 | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg
+| 7609 kB | pg_default | default template for new databases
| | | | | pg=CTc/pg
| | |
tpcc | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 10 GB | pg_default |
tpce | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 26 GB | pg_default |
tpch | pg | UTF8 | en_US.UTF8 | en_US.UTF8 |
| 32 GB | pg_default |
(9 rows)
I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.
Want me to send this data and the associated tests script over to you?
--
Peter Geoghegan
23.08.2019 7:33, Peter Geoghegan wrote:
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I'm going to look through the patch once more to update nbtxlog
comments, where needed and
answer to your remarks that are still left in the comments.Have you been using amcheck's rootdescend verification?
No, I haven't checked it with the latest version yet.
There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.
Do these indexes have something in common? Maybe some specific workload?
Are there any error messages in log?
I'd like to specify what caused the problem.
There were several major changes between v5 and v8:
- dead tuples handling added in v6;
- _bt_split changes for posting tuples in v7;
- WAL logging of posting tuple changes in v8.
I don't think the last one could break regular indexes on master.
Do you see the same regression in v6, v7?
I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:Want me to send this data and the associated tests script over to you?
Yes, I think it will help me to debug the patch faster.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Now the algorithm is the following:
- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.
I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.
Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.
We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:
* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.
* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.
* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)
Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.
What do you think of that approach?
--
Peter Geoghegan
28.08.2019 6:19, Peter Geoghegan wrote:
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Now the algorithm is the following:
- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.What do you think of that approach?
I think it's a good idea. Thank you for such a detailed description of
various
cases. I already started to simplify this code, while debugging amcheck
error
in v8. At first, I rewrote it to split posting tuple into a posting and a
regular tuple instead of two posting tuples.
Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples
specific.
The code is much cleaner now.
The new version is attached. It passes regression tests. I also run land
and
tpch test. They pass amcheck rootdescend and if I interpreted results
correctly, the new version shows slightly better compression.
\l+
tpch | anastasia | UTF8 | ru_RU.UTF-8 | ru_RU.UTF-8 | | 31
GB | pg_default |
land | anastasia | UTF8 | ru_RU.UTF-8 | ru_RU.UTF-8 | | 6380
MB | pg_default |
Some individual indexes are larger, some are smaller compared to the
expected output.
This patch is based on v6, so it again contains "compression" instead of
"deduplication"
in variable names and comments. I will rename them when code becomes
more stable.
--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v9-0001-Compression-deduplication-in-nbtree.patchtext/x-patch; name=v9-0001-Compression-deduplication-in-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..1751133 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,15 +47,17 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page);
+ bool split_only_page, int in_posting_offset);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple neworigtup);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +299,12 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false, insertstate.in_posting_offset);
}
else
{
@@ -435,6 +439,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +764,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * compression avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +925,75 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+
+/*
+ * Replace tuple on newitemoff offset with neworigtup,
+ * and insert newitup right after it.
+ *
+ * It's essential to do this atomic to be crash safe.
+ *
+ * NOTE All checks of free space must be done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+void
+_bt_replace_and_insert(Buffer buf,
+ Page page,
+ IndexTuple neworigtup, IndexTuple newitup,
+ OffsetNumber newitemoff, bool need_xlog)
+{
+ Size newitupsz = IndexTupleSize(newitup);
+ IndexTuple origtup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, newitemoff));
+
+ Assert(BTreeTupleIsPosting(origtup));
+ Assert(BTreeTupleIsPosting(neworigtup));
+ Assert(!BTreeTupleIsPosting(newitup));
+ Assert(MAXALIGN(IndexTupleSize(origtup)) == MAXALIGN(IndexTupleSize(neworigtup)));
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ /*
+ * Since we always replace posting tuple with tuple of same size
+ * (only posting list may changes), can do simple inplace update.
+ */
+ memcpy(origtup, neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, OffsetNumberNext(newitemoff)))
+ elog(ERROR, "failed to insert compressed item in index");
+
+ if (BufferIsValid(buf))
+ {
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (need_xlog)
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+ xlrec.origtup_off = MAXALIGN(IndexTupleSize(newitup));
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) newitup, MAXALIGN(IndexTupleSize(newitup)));
+ XLogRegisterBufData(0, (char *) neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+
+ PageSetLSN(page, recptr);
+ }
+ }
+ END_CRIT_SECTION();
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -936,11 +1030,13 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
- bool split_only_page)
+ bool split_only_page,
+ int in_posting_offset)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple neworigtup = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -964,6 +1060,120 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ if (in_posting_offset)
+ {
+ /* get old posting tuple */
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ int nipd;
+ IndexTuple origtup;
+ char *src;
+ char *dest;
+ size_t ntocopy;
+
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "origtup (%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /* generate neworigtup */
+
+ /*
+ * Handle corner cases (1)
+ * - itup TID is smaller than leftmost orightup TID
+ */
+ if (ItemPointerCompare(BTreeTupleGetHeapTID(itup),
+ BTreeTupleGetHeapTID(origtup)) < 0)
+ {
+ in_posting_offset = InvalidOffsetNumber;
+ newitemoff = OffsetNumberPrev(newitemoff); //TODO Is it needed?
+ elog(DEBUG4, "itup is to the left of origtup newitemoff %u", newitemoff);
+ }
+ /*
+ * Handle corner cases (2)
+ * - itup TID is larger than rightmost orightup TID
+ */
+ else if (ItemPointerCompare(BTreeTupleGetMaxTID(origtup),
+ BTreeTupleGetHeapTID(itup)) < 0)
+ {
+ /* do nothing */
+ in_posting_offset = InvalidOffsetNumber;
+ //newitemoff = OffsetNumberNext(newitemoff); //TODO Is it needed?
+ elog(DEBUG4, "itup is to the right of origtup newitemoff %u", newitemoff);
+ }
+ /* Handle insertion into the middle of the posting list */
+ else
+ {
+ neworigtup = CopyIndexTuple(origtup);
+ src = (char *) BTreeTupleGetPostingN(neworigtup, in_posting_offset);
+ dest = (char *) src + sizeof(ItemPointerData);
+ ntocopy = (nipd - in_posting_offset - 1)*sizeof(ItemPointerData);
+
+ elog(DEBUG4, "itup is inside origtup"
+ " nipd %d in_posting_offset %d ntocopy %lu newitemoff %u",
+ nipd, in_posting_offset, ntocopy, newitemoff);
+ elog(DEBUG4, "neworigtup before N %d (%u,%u) to (%u,%u)",
+ BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)));
+
+ elog(DEBUG4, "itup before (%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)));
+ elog(DEBUG4, "src before (%u,%u)",
+ ItemPointerGetBlockNumberNoCheck((ItemPointer) src),
+ ItemPointerGetOffsetNumberNoCheck((ItemPointer) src));
+ elog(DEBUG4, "dest before (%u,%u)",
+ ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
+ ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));
+ /* move itemp pointers in posting list to free space for incoming one */
+ memmove(dest, src, ntocopy);
+
+ /* copy new item pointer to posting list */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) src);
+
+ /* copy old rightmost item pointer to new tuple, that we're going to insert */
+ ItemPointerCopy(BTreeTupleGetPostingN(origtup, nipd-1), &itup->t_tid);
+
+ elog(DEBUG4, "neworigtup N %d (%u,%u) to (%u,%u)",
+ BTreeTupleIsPosting(neworigtup)?BTreeTupleGetNPosting(neworigtup):0,
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(neworigtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(neworigtup)));
+
+// for (int i = 0; i < BTreeTupleGetNPosting(neworigtup); i++)
+// {
+// elog(WARNING, "neworigtup item n %d (%u,%u)",
+// i,
+// ItemPointerGetBlockNumberNoCheck(BTreeTupleGetPostingN(neworigtup, i)),
+// ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetPostingN(neworigtup, i)));
+// }
+
+ elog(DEBUG4, "itup (%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)));
+
+ Assert(!BTreeTupleIsPosting(itup));
+ Assert(ItemPointerCompare(BTreeTupleGetHeapTID(neworigtup),
+ BTreeTupleGetMaxTID(neworigtup)) < 0);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(neworigtup),
+ BTreeTupleGetHeapTID(itup)) < 0);
+ }
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1206,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf,
+ newitemoff, itemsz, itup, neworigtup);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1033,70 +1244,152 @@ _bt_insertonpg(Relation rel,
itup_off = newitemoff;
itup_blkno = BufferGetBlockNumber(buf);
- /*
- * If we are doing this insert because we split a page that was the
- * only one on its tree level, but was not the root, it may have been
- * the "fast root". We need to ensure that the fast root link points
- * at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
- */
- if (split_only_page)
+ if (neworigtup == NULL)
{
- Assert(!P_ISLEAF(lpageop));
+ /*
+ * If we are doing this insert because we split a page that was the
+ * only one on its tree level, but was not the root, it may have been
+ * the "fast root". We need to ensure that the fast root link points
+ * at or above the current page. We can safely acquire a lock on the
+ * metapage here --- see comments for _bt_newroot().
+ */
+ if (split_only_page)
+ {
+ Assert(!P_ISLEAF(lpageop));
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ if (metad->btm_fastlevel >= lpageop->btpo.level)
+ {
+ /* no update wanted */
+ _bt_relbuf(rel, metabuf);
+ metabuf = InvalidBuffer;
+ }
+ }
- metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
- metapg = BufferGetPage(metabuf);
- metad = BTPageGetMeta(metapg);
+ /*
+ * Every internal page should have exactly one negative infinity item
+ * at all times. Only _bt_split() and _bt_newroot() should add items
+ * that become negative infinity items through truncation, since
+ * they're the only routines that allocate new internal pages. Do not
+ * allow a retail insertion of a new item at the negative infinity
+ * offset.
+ */
+ if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
+ elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ /* Do the update. No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
+ elog(PANIC, "failed to add new item to block %u in index \"%s\"",
+ itup_blkno, RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
- if (metad->btm_fastlevel >= lpageop->btpo.level)
+ if (BufferIsValid(metabuf))
{
- /* no update wanted */
- _bt_relbuf(rel, metabuf);
- metabuf = InvalidBuffer;
+ /* upgrade meta-page if needed */
+ if (metad->btm_version < BTREE_NOVAC_VERSION)
+ _bt_upgrademetapage(metapg);
+ metad->btm_fastroot = itup_blkno;
+ metad->btm_fastlevel = lpageop->btpo.level;
+ MarkBufferDirty(metabuf);
}
- }
- /*
- * Every internal page should have exactly one negative infinity item
- * at all times. Only _bt_split() and _bt_newroot() should add items
- * that become negative infinity items through truncation, since
- * they're the only routines that allocate new internal pages. Do not
- * allow a retail insertion of a new item at the negative infinity
- * offset.
- */
- if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
- elog(ERROR, "cannot insert second negative infinity item in block %u of index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
+ if (BufferIsValid(cbuf))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
- /* Do the update. No ereport(ERROR) until changes are logged */
- START_CRIT_SECTION();
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
- if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
- elog(PANIC, "failed to add new item to block %u in index \"%s\"",
- itup_blkno, RelationGetRelationName(rel));
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ xl_btree_metadata xlmeta;
+ uint8 xlinfo;
+ XLogRecPtr recptr;
- MarkBufferDirty(buf);
+ xlrec.offnum = itup_off;
+ xlrec.origtup_off = 0;
- if (BufferIsValid(metabuf))
- {
- /* upgrade meta-page if needed */
- if (metad->btm_version < BTREE_NOVAC_VERSION)
- _bt_upgrademetapage(metapg);
- metad->btm_fastroot = itup_blkno;
- metad->btm_fastlevel = lpageop->btpo.level;
- MarkBufferDirty(metabuf);
- }
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
- /* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
- if (BufferIsValid(cbuf))
- {
- Page cpage = BufferGetPage(cbuf);
- BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ if (P_ISLEAF(lpageop))
+ xlinfo = XLOG_BTREE_INSERT_LEAF;
+ else
+ {
+ /*
+ * Register the left child whose INCOMPLETE_SPLIT flag was
+ * cleared.
+ */
+ XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
+
+ xlinfo = XLOG_BTREE_INSERT_UPPER;
+ }
+
+ if (BufferIsValid(metabuf))
+ {
+ Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
+ xlmeta.version = metad->btm_version;
+ xlmeta.root = metad->btm_root;
+ xlmeta.level = metad->btm_level;
+ xlmeta.fastroot = metad->btm_fastroot;
+ xlmeta.fastlevel = metad->btm_fastlevel;
+ xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
+ xlmeta.last_cleanup_num_heap_tuples =
+ metad->btm_last_cleanup_num_heap_tuples;
+
+ XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
+ XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
+
+ xlinfo = XLOG_BTREE_INSERT_META;
+ }
+
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+
+ if (BufferIsValid(metabuf))
+ {
+ PageSetLSN(metapg, recptr);
+ }
+ if (BufferIsValid(cbuf))
+ {
+ PageSetLSN(BufferGetPage(cbuf), recptr);
+ }
- Assert(P_INCOMPLETE_SPLIT(cpageop));
- cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
- MarkBufferDirty(cbuf);
+ PageSetLSN(page, recptr);
+ }
+ END_CRIT_SECTION();
}
+ else
+ {
+ /*
+ * Insert new tuple on place of existing posting tuple.
+ * Delete old posting tuple, and insert updated tuple instead.
+ *
+ * If split was needed, both neworigtup and newrighttup are initialized
+ * and both will be inserted, otherwise newrighttup is NULL.
+ *
+ * It only can happen on leaf page.
+ */
+ elog(DEBUG4, "_bt_insertonpg. _bt_replace_and_insert %s newitemoff %u",
+ RelationGetRelationName(rel), newitemoff);
+ _bt_replace_and_insert(buf, page, neworigtup,
+ itup, newitemoff, RelationNeedsWAL(rel));
+ }
/*
* Cache the block information if we just inserted into the rightmost
@@ -1107,69 +1400,6 @@ _bt_insertonpg(Relation rel,
if (P_RIGHTMOST(lpageop) && P_ISLEAF(lpageop) && !P_ISROOT(lpageop))
cachedBlock = BufferGetBlockNumber(buf);
- /* XLOG stuff */
- if (RelationNeedsWAL(rel))
- {
- xl_btree_insert xlrec;
- xl_btree_metadata xlmeta;
- uint8 xlinfo;
- XLogRecPtr recptr;
-
- xlrec.offnum = itup_off;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
-
- if (P_ISLEAF(lpageop))
- xlinfo = XLOG_BTREE_INSERT_LEAF;
- else
- {
- /*
- * Register the left child whose INCOMPLETE_SPLIT flag was
- * cleared.
- */
- XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
-
- xlinfo = XLOG_BTREE_INSERT_UPPER;
- }
-
- if (BufferIsValid(metabuf))
- {
- Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
- xlmeta.version = metad->btm_version;
- xlmeta.root = metad->btm_root;
- xlmeta.level = metad->btm_level;
- xlmeta.fastroot = metad->btm_fastroot;
- xlmeta.fastlevel = metad->btm_fastlevel;
- xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
- xlmeta.last_cleanup_num_heap_tuples =
- metad->btm_last_cleanup_num_heap_tuples;
-
- XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
- XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
-
- xlinfo = XLOG_BTREE_INSERT_META;
- }
-
- XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
-
- recptr = XLogInsert(RM_BTREE_ID, xlinfo);
-
- if (BufferIsValid(metabuf))
- {
- PageSetLSN(metapg, recptr);
- }
- if (BufferIsValid(cbuf))
- {
- PageSetLSN(BufferGetPage(cbuf), recptr);
- }
-
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
/* release buffers */
if (BufferIsValid(metabuf))
_bt_relbuf(rel, metabuf);
@@ -1211,10 +1441,20 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * TODO improve comment
+ * The real *new* item is already inside neorigtup in the correct place according to TID order
+ * And "newitem" contains rightmost ItemPointerData trimmed from posting list.
+ * Insertion consists of two steps
+ * - replace original item at newitemoff with neworigtup
+ * This operation doesn't change origtup size, so all calculations
+ * of splitloc remain the same.
+ * - insert newitem right after that as if we inserted a regular tuple
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple neworigtup)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1476,19 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replaceitemoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ if (neworigtup != NULL)
+ {
+ replaceitemoff = newitemoff;
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1340,6 +1587,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ if (firstright == replaceitemoff)
+ item = neworigtup;
}
/*
@@ -1373,6 +1622,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ if (lastleftoff == replaceitemoff)
+ lastleft = neworigtup;
}
Assert(lastleft != item);
@@ -1480,6 +1731,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* TODO add comment */
+ if (i == replaceitemoff)
+ {
+ item = neworigtup;
+ Assert(neworigtup != NULL);
+ }
+
/* does new item belong before this one? */
if (i == newitemoff)
{
@@ -1652,6 +1910,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.level = ropaque->btpo.level;
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.replaceitemoff = replaceitemoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1681,6 +1940,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
item = (IndexTuple) PageGetItem(origpage, itemid);
XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
+ if (replaceitemoff)
+ XLogRegisterBufData(0, (char *) neworigtup,
+ MAXALIGN(IndexTupleSize(neworigtup)));
+
/*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). The whole right page will be recreated.
@@ -1835,7 +2098,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
- is_only);
+ is_only, InvalidOffsetNumber);
/* be tidy */
pfree(new_item);
@@ -2307,3 +2570,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+
+ /*
+ * Delete dead tuples if any.
+ * We cannot simply skip them in the cycle below, because it's neccessary
+ * to generate special Xlog record containing such tuples to compute
+ * latestRemovedXid on a standby server later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during compression");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..fca35a4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..a85c67b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,81 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ elog(DEBUG4, "rel %s btreevacuumPosting offnum %u",
+ RelationGetRelationName(vstate->info->index), offnum);
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1331,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1348,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1434,47 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ elog(DEBUG4, "rel %s btreevacuumPosting i %d, (%u,%u)",
+ RelationGetRelationName(vstate->info->index),
+ i,
+ ItemPointerGetBlockNumberNoCheck( (items + i)),
+ ItemPointerGetOffsetNumberNoCheck((items + i)));
+
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed2..72e52bc 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -497,7 +500,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -526,6 +530,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -658,61 +711,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1449,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1596,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1517,7 +1644,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1525,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1567,8 +1694,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1582,8 +1724,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1738,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1608,6 +1752,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..d7207e0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1128,6 +1137,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1362,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..0ead2ea 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..7be2542 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..2015a5b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -163,6 +163,7 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
Buffer buffer;
Page page;
+
/*
* Insertion to an internal page finishes an incomplete split at the child
* level. Clear the incomplete-split flag in the child. Note: during
@@ -178,9 +179,23 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
{
Size datalen;
char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ IndexTuple neworigtup = NULL;
+
page = BufferGetPage(buffer);
+ if (xlrec->origtup_off > 0)
+ {
+ IndexTuple origtup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, xlrec->offnum));
+ neworigtup = (IndexTuple) (datapos + xlrec->origtup_off);
+
+ Assert(MAXALIGN(IndexTupleSize(origtup)) == MAXALIGN(IndexTupleSize(neworigtup)));
+
+ memcpy(origtup, neworigtup, MAXALIGN(IndexTupleSize(neworigtup)));
+ xlrec->offnum = OffsetNumberNext(xlrec->offnum);
+ }
+
if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
false, false) == InvalidOffsetNumber)
elog(PANIC, "btree_xlog_insert: failed to add item");
@@ -265,9 +280,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ replaceitem = NULL;
Size newitemsz = 0,
- left_hikeysz = 0;
+ left_hikeysz = 0,
+ replaceitemsz = 0;
Page newlpage;
OffsetNumber leftoff;
@@ -287,6 +304,14 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datapos += left_hikeysz;
datalen -= left_hikeysz;
+ if (xlrec->replaceitemoff)
+ {
+ replaceitem = (IndexTuple) datapos;
+ replaceitemsz = MAXALIGN(IndexTupleSize(replaceitem));
+ datapos += replaceitemsz;
+ datalen -= replaceitemsz;
+ }
+
Assert(datalen == 0);
newlpage = PageGetTempPageCopySpecial(lpage);
@@ -304,6 +329,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == xlrec->replaceitemoff)
+ {
+ if (PageAddItem(newlpage, (Item) replaceitem, replaceitemsz, leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -386,8 +420,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +512,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..243e464 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -31,6 +31,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "origtup_off %lu", xlrec->origtup_off);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -46,8 +47,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d2700fc 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
-/* Get/set downlink block number */
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +661,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -566,6 +733,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +747,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -732,7 +901,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
-
+extern void _bt_replace_and_insert(Buffer buf, Page page,
+ IndexTuple neworigtup, IndexTuple newitup,
+ OffsetNumber newitemoff, bool need_xlog);
/*
* prototypes for functions in nbtsplitloc.c
*/
@@ -762,6 +933,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -774,6 +947,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -812,6 +987,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1002,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..f1ef584 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,24 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if origtup_off is not 0, data also contains 'neworigtup' -
+ * tuple to replace original (see comments in bt_replace_and_insert()).
+ * TODO probably it would be enough to keep just a flag to point out that
+ * data contains 'neworigtup' and compute its offset
+ * as we know it follows the tuple, but I am afraid that
+ * it will break alignment, will it?
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ Size origtup_off;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, origtup_off) + sizeof(Size))
/*
* On insert with split, we save all the items going into the right sibling
@@ -96,6 +104,12 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'neworigtup',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -113,9 +127,10 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber replaceitemoff; /* offset of the posting item to replace with (neworigtup) */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, replaceitemoff) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +188,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Your explanation helped me to understand that this approach can be
extended to
the case of insertion into posting list, that doesn't trigger posting
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples
specific.
The code is much cleaner now.
Fantastic!
Some individual indexes are larger, some are smaller compared to the
expected output.
I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.
We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.
I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)
Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?
BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.
How do you feel about this CREATE INDEX index-size-is-larger business?
--
Peter Geoghegan
On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
I agree that v9 might be ever so slightly more space efficient than v5
was, on balance.
I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.
First one:
==11193== VALGRINDERROR-BEGIN
==11193== Unaddressable byte(s) found during client check request
==11193== at 0x4C0E03: PageAddItemExtended (bufpage.c:332)
==11193== by 0x20F6C3: _bt_split (nbtinsert.c:1643)
==11193== by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206)
==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193== by 0x3B85A1: main (main.c:210)
==11193== Address 0x9c11350 is 0 bytes after a recently re-allocated
block of size 8,192 alloc'd
==11193== at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193== by 0x61085A: AllocSetAlloc (aset.c:914)
==11193== by 0x617AD8: palloc (mcxt.c:938)
==11193== by 0x21A829: _bt_mkscankey (nbtutils.c:107)
==11193== by 0x2118F3: _bt_doinsert (nbtinsert.c:93)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==
==11193== VALGRINDERROR-END
{
<insert_a_suppression_name_here>
Memcheck:User
fun:PageAddItemExtended
fun:_bt_split
fun:_bt_insertonpg
fun:_bt_doinsert
fun:btinsert
fun:index_insert
fun:ExecInsertIndexTuples
fun:ExecInsert
fun:ExecModifyTable
fun:ExecProcNodeFirst
fun:ExecProcNode
fun:ExecutePlan
fun:standard_ExecutorRun
fun:ExecutorRun
fun:ProcessQuery
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
fun:BackendStartup
fun:ServerLoop
fun:PostmasterMain
fun:main
}
nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the
lefthikey call.
Second one:
==11193== VALGRINDERROR-BEGIN
==11193== Invalid read of size 2
==11193== at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126)
==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193== by 0x3B85A1: main (main.c:210)
==11193== Address 0x9905b90 is 11,088 bytes inside a recently
re-allocated block of size 524,288 alloc'd
==11193== at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193== by 0x61085A: AllocSetAlloc (aset.c:914)
==11193== by 0x617AD8: palloc (mcxt.c:938)
==11193== by 0x1C5677: CopyIndexTuple (indextuple.c:508)
==11193== by 0x20E887: _bt_compress_one_page (nbtinsert.c:2751)
==11193== by 0x21241E: _bt_findinsertloc (nbtinsert.c:773)
==11193== by 0x21241E: _bt_doinsert (nbtinsert.c:303)
==11193== by 0x2150EE: btinsert (nbtree.c:207)
==11193== by 0x20D63A: index_insert (indexam.c:186)
==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193== by 0x36C738: ExecProcNode (executor.h:240)
==11193== by 0x36C738: ExecutePlan (execMain.c:1648)
==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193== by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193== by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193== by 0x4CD31C: PortalRun (pquery.c:796)
==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193== by 0x453650: BackendRun (postmaster.c:4446)
==11193== by 0x453650: BackendStartup (postmaster.c:4137)
==11193== by 0x453650: ServerLoop (postmaster.c:1704)
==11193==
==11193== VALGRINDERROR-END
{
<insert_a_suppression_name_here>
Memcheck:Addr2
fun:_bt_insertonpg
fun:_bt_doinsert
fun:btinsert
fun:index_insert
fun:ExecInsertIndexTuples
fun:ExecInsert
fun:ExecModifyTable
fun:ExecProcNodeFirst
fun:ExecProcNode
fun:ExecutePlan
fun:standard_ExecutorRun
fun:ExecutorRun
fun:ProcessQuery
fun:PortalRunMulti
fun:PortalRun
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
fun:BackendStartup
fun:ServerLoop
fun:PostmasterMain
fun:main
}
nbtinsert.c:1126 is this code from _bt_insertonpg():
elog(DEBUG4, "dest before (%u,%u)",
ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));
This is probably harmless, but it needs to be fixed.
--
Peter Geoghegan
On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.
I've found a fix for these Valgrind issues. It's a matter of making
sure that _bt_truncate() sizes new pivot tuples properly, which is
quite subtle:
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright,
{
BTreeTupleClearBtIsPosting(pivot);
BTreeTupleSetNAtts(pivot, keepnatts);
- pivot->t_info &= ~INDEX_SIZE_MASK;
- pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ if (keepnatts == natts)
+ {
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |=
MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
}
I'm varying how the new pivot tuple is sized here according to whether
or not index_truncate_tuple() just does a CopyIndexTuple(). This very
slightly changes the behavior of the nbtsplitloc.c stuff, but that's
not a concern for me.
I will post a patch with this and other tweaks next week.
--
Peter Geoghegan
On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg@bowt.ie> wrote:
I've found a fix for these Valgrind issues.
Attach is v10, which fixes the Valgrind issue.
Other changes:
* The code now fully embraces the idea that posting list splits
involve "changing the incoming item" in a way that "avoids" having the
new/incoming item overlap with an existing posting list tuple. This
allowed me to cut down on the changes required within nbtinsert.c
considerably.
* Streamlined a lot of the code in nbtsearch.c. I was able to
significantly simplify _bt_compare() and _bt_binsrch_insert().
* Removed the DEBUG4 traces. A lot of these had to go when I
refactored nbtsearch.c code, so I thought I might as well removed the
remaining ones. I hope that you don't mind (go ahead and add them back
where that makes sense).
* A backwards scan will return "logical tuples" in descending order
now. We should do this on general principle, and also because of the
possibility of future external code that expects and takes advantage
of consistent heap TID order.
This change might even have a small performance benefit today, though:
Index scans that visit multiple heap pages but only match on a single
key will only pin each heap page visited once. Visiting the heap pages
in descending order within a B-Tree page full of duplicates, but
ascending order within individual posting lists could result in
unnecessary extra pinning.
* Standardized terminology. We consistently call what the patch adds
"deduplication" rather than "compression".
* Added a new section on the design to the nbtree README. This is
fairly high level, and talks about dynamics that we can't really talk
about anywhere else, such as how nbtsplitloc.c "cooperates" with
deduplication, producing an effect that is greater than the sum of its
parts.
* I also made some changes to the WAL logging for leaf page insertions
and page splits.
I didn't add the optimization that you anticipated in your nbtxlog.h
comments (i.e. only WAL-log a rewritten posting list when it will go
on the left half of the split, just like the new/incoming item thing
we have already). I agree that that's a good idea, and should be added
soon. Actually, I think the whole "new item vs. rewritten posting list
item" thing makes the WAL logging confusing, so this is not really
about performance.
Maybe the easiest way to do this is also the way that performs best.
I'm thinking of this: maybe we could completely avoid WAL-logging the
entire rewritten/split posting list. After all, the contents of the
rewritten posting list are derived from the existing/original posting
list, as well as the new/incoming item. We can make the WAL record
much smaller on average by making standbys repeat a little bit of the
work performed on the primary. Maybe we could WAL-log
"in_posting_offset" itself, and an ItemPointerData (obviously the new
item offset number tells us the offset number of the posting list that
must be replaced/memmoved()'d). Then have the standby repeat some of
the work performed on the primary -- at least the work of swapping a
heap TID could be repeated on standbys, since it's very little extra
work for standbys, but could really reduce the WAL volume. This might
actually be simpler.
The WAL logging that I didn't touch in v10 is the most important thing
to improve. I am talking about the WAL-logging that is performed as
part of deduplicating all items on a page, to avoid a page split (i.e.
the WAL-logging within _bt_dedup_one_page()). That still just does a
log_newpage_buffer() in v10, which is pretty inefficient. Much like
the posting list split WAL logging stuff, WAL logging in
_bt_dedup_one_page() can probably be made more efficient by describing
deduplication in terms of logical changes. For example, the WAL
records should consist of metadata that could be read by a human as
"merge the tuples from offset number 15 until offset number 27".
Perhaps this could also share code with the posting list split stuff.
What do you think?
Once we make the WAL-logging within _bt_dedup_one_page() more
efficient, that also makes it fairly easy to make the deduplication
that it performs occur incrementally, maybe even very incrementally. I
can imagine the _bt_dedup_one_page() caller specifying "my new tuple
is 32 bytes, and I'd really like to not have to split the page, so
please at least do enough deduplication to make it fit". Delaying
deduplication increases the amount of time that we have to set the
LP_DEAD bit for remaining items on the page, which might be important.
Also, spreading out the volume of WAL produced by deduplication over
time might be important with certain workloads. We would still
probably do somewhat more work than strictly necessary to avoid a page
split if we were to make _bt_dedup_one_page() incremental like this,
though not by a huge amount.
OTOH, maybe I am completely wrong about "incremental deduplication"
being a good idea. It seems worth experimenting with, though. It's not
that much more work on top of making the _bt_dedup_one_page()
WAL-logging efficient, which seems like the thing we should focus on
now.
Thoughts?
--
Peter Geoghegan
Attachments:
v10-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v10-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 92d9c62d9c92da8e876d07d4335572c8eded0ae8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v10 2/2] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 67 +++++++++++++++----
contrib/pageinspect/expected/btree.out | 3 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 22 ++++++
3 files changed, 78 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..f95f3ad892 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[7];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer htid;
+ BTPageOpaque opaque;
id = PageGetItemId(page, offset);
@@ -283,16 +287,52 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (P_ISLEAF(opaque) && offset >= P_FIRSTDATAKEY(opaque))
+ htid = &itup->t_tid;
+ else if (_bt_heapkeyspace(rel))
+ htid = BTreeTupleGetHeapTID(itup);
+ else
+ htid = NULL;
+
+ if (htid)
+ values[j] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(htid),
+ ItemPointerGetOffsetNumberNoCheck(htid));
+ else
+ values[j] = NULL;
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +406,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +437,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +523,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..067e73f21a 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,8 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..9acbad1589 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,25 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v10-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v10-0001-Add-deduplication-to-nbtree.patchDownload
From 6c1bb94b2f9c39af784f2d7ebe461251a63a71ba Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v10 1/2] Add deduplication to nbtree.
---
contrib/amcheck/verify_nbtree.c | 126 ++++++--
src/backend/access/nbtree/README | 70 ++++-
src/backend/access/nbtree/nbtinsert.c | 379 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 53 ++++
src/backend/access/nbtree/nbtree.c | 143 +++++++--
src/backend/access/nbtree/nbtsearch.c | 245 +++++++++++++--
src/backend/access/nbtree/nbtsort.c | 196 +++++++++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 ++-
src/backend/access/nbtree/nbtutils.c | 210 +++++++++++--
src/backend/access/nbtree/nbtxlog.c | 88 +++++-
src/backend/access/rmgrdesc/nbtdesc.c | 10 +-
src/include/access/nbtree.h | 206 ++++++++++++-
src/include/access/nbtxlog.h | 36 ++-
src/tools/valgrind.supp | 21 ++
14 files changed, 1688 insertions(+), 142 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..f2ebd215b2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset == 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..2be064153d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples never have the LP_DEAD bit set, since each
+"logical" tuple may or may not be "known dead".)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,71 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We cannot set the LP_DEAD bit with posting list
+tuples. (Bitmap scans cannot perform LP_DEAD bit setting, and are the
+common case with indexes that contain lots of duplicates, so this downside
+is considered acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters). When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed. Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, space
+utilization is improved and page fragmentation is avoided by keeping
+existing posting lists large.
+
+Currently, posting lists are not compressed. It would be straightforward
+to add GIN-style posting list compression based on varbyte encoding. That
+would probably need to be configurable and not enabled by default, because
+the overhead of decompression would be an obvious downside, especially with
+backwards scans.
+
+TODO: Review whether or not basic deduplication should be enabled by
+default.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..f2fe3f77ce 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,25 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple nposting);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTDedupState *dedupState);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +127,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +305,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +440,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +765,15 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to deduplicate items on page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
else
{
@@ -905,10 +920,11 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +934,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +953,14 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +974,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +986,70 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Replace newitem's heap TID with rightmost heap TID from original
+ * posting list
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1082,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ nposting);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1162,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place
+ * update of the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1215,9 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingsz = 0;
+ if (nposting)
+ xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1255,9 @@ _bt_insertonpg(Relation rel,
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (nposting)
+ XLogRegisterBufData(0, (char *) nposting,
+ IndexTupleSize(nposting));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1299,10 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
}
/*
@@ -1211,10 +1320,16 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple nposting)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1351,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1396,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1470,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1506,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1616,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1652,6 +1803,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.level = ropaque->btpo.level;
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.replacepostingoff = replacepostingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1828,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
if (newitemonleft)
XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (replacepostingoff)
+ XLogRegisterBufData(0, (char *) nposting,
+ MAXALIGN(IndexTupleSize(nposting)));
+
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +1990,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2460,209 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's neccessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (dedupState->itupprev != NULL)
+ {
+ if (_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list is too big, insert it on page and continue
+ * with this tuple as new pending posting list. Otherwise,
+ * append the tuple to the pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (dedupState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(dedupState->itupprev)
+ + (dedupState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(dedupState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, dedupState);
+ }
+ }
+ else
+ {
+ /* Insert pending posting list on page */
+ insert_itupprev_to_page(newpage, dedupState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, dedupState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * Add new item to the page, while deduplicating
+ */
+static void
+insert_itupprev_to_page(Page page, BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 18c6de21c1..55344a7d78 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..ea7ff6a5f9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1431,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 7f77ed24c5..fb976cad92 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -347,12 +355,13 @@ _bt_binsrch(Relation rel,
int32 result,
cmpval;
- /* Requesting nextkey semantics while using scantid seems nonsensical */
- Assert(!key->nextkey || key->scantid == NULL);
-
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* Requesting nextkey semantics while using scantid seems nonsensical */
+ Assert(!key->nextkey || key->scantid == NULL);
+ /* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+ Assert(!P_ISLEAF(opaque) || key->scantid == NULL);
low = P_FIRSTDATAKEY(opaque);
high = PageGetMaxOffsetNumber(page);
@@ -432,7 +441,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -507,6 +519,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -526,6 +549,60 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key,
+ Page page,
+ OffsetNumber offnum)
+{
+ IndexTuple itup;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -535,9 +612,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -561,6 +647,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -595,7 +682,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -711,8 +797,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1449,6 +1551,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1483,8 +1586,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1517,7 +1642,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1525,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1567,8 +1692,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ }
+ }
}
if (!continuescan)
{
@@ -1582,8 +1736,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1596,6 +1750,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1608,6 +1764,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..b2a2039a3d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -963,6 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1009,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1051,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1127,6 +1136,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTDedupState *dedupState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+ if (BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1141,9 +1235,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1362,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_add_posting_item(dedupState, itup);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1a19..13c767164d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid.
+ * Note that this handles posting list tuples by setting scantid to the
+ * lowest heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1796,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* Never mark line pointers for posting list tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2156,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2190,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2199,26 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2175,7 +2226,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2205,7 +2257,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2268,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2286,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2295,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2386,18 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2422,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2415,7 +2513,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2568,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2595,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2567,11 +2665,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
{
Size datalen;
char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ IndexTuple nposting = NULL;
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingsz > 0)
+ {
+ IndexTuple oposting;
+
+ Assert(isleaf);
+
+ /* oposting must be at offset before new item */
+ oposting = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+ if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
- left_hikeysz = 0;
+ left_hikeysz = 0,
+ npostingsz = 0;
Page newlpage;
OffsetNumber leftoff;
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datalen -= newitemsz;
}
+ if (xlrec->replacepostingoff)
+ {
+ Assert(xlrec->replacepostingoff ==
+ OffsetNumberPrev(xlrec->newitemoff));
+
+ nposting = (IndexTuple) datapos;
+ npostingsz = MAXALIGN(IndexTupleSize(nposting));
+ datapos += npostingsz;
+ datalen -= npostingsz;
+ }
+
/* Extract left hikey and its size (assuming 16-bit alignment) */
left_hikey = (IndexTuple) datapos;
left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == xlrec->replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..6f71b13199 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingsz %u",
+ xlrec->offnum, xlrec->postingsz);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,6 +39,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
+ /* FIXME: even master doesn't have newitemoff */
appendStringInfo(buf, "level %u, firstright %d",
xlrec->level, xlrec->firstright);
break;
@@ -46,8 +48,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..a3dec41f0a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +478,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +488,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +496,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +507,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +519,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +659,12 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +700,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +731,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +750,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -762,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1001,6 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTDedupState *dedupState, IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..daa931377f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingsz is not 0, data also contains 'nposting' -
+ * tuple to replace original.
+ *
+ * TODO probably it would be enough to keep just a flag to point
+ * out that data contains 'nposting' and compute its offset as
+ * we know it follows the tuple, but I am afraid that it will
+ * break alignment, will it?
+ *
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ uint32 postingsz;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
/*
* On insert with split, we save all the items going into the right sibling
@@ -96,6 +106,12 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -113,9 +129,10 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber replacepostingoff; /* offset of the posting item to replace */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +190,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
On Mon, Sep 2, 2019 at 6:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attach is v10, which fixes the Valgrind issue.
Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.
The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.
I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.
Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.
--
Peter Geoghegan
Attachments:
v11-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v11-0001-Add-deduplication-to-nbtree.patchDownload
From c07e06ff1ee2a0c595cdf773546c69940db73dd6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v11 1/2] Add deduplication to nbtree.
---
contrib/amcheck/verify_nbtree.c | 128 +++++--
src/backend/access/nbtree/README | 76 +++-
src/backend/access/nbtree/nbtinsert.c | 462 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 148 +++++++-
src/backend/access/nbtree/nbtree.c | 147 ++++++--
src/backend/access/nbtree/nbtsearch.c | 247 ++++++++++++-
src/backend/access/nbtree/nbtsort.c | 219 ++++++++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 ++-
src/backend/access/nbtree/nbtutils.c | 264 ++++++++++++--
src/backend/access/nbtree/nbtxlog.c | 88 ++++-
src/backend/access/rmgrdesc/nbtdesc.c | 10 +-
src/include/access/nbtree.h | 242 +++++++++++--
src/include/access/nbtxlog.h | 36 +-
src/tools/valgrind.supp | 21 ++
14 files changed, 1957 insertions(+), 178 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..399743d4d6 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..50ec9ef48c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters). When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed. Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..bef5958465 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple nposting);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size itemsz);
+static void _bt_dedup_insert(Page page, BTDedupState *dedupState);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +947,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +962,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +981,14 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1002,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1014,72 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Replace newitem's heap TID with rightmost heap TID from original
+ * posting list
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1112,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ nposting);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1192,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1245,9 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingsz = 0;
+ if (nposting)
+ xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1285,9 @@ _bt_insertonpg(Relation rel,
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (nposting)
+ XLogRegisterBufData(0, (char *) nposting,
+ IndexTupleSize(nposting));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1329,10 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
}
/*
@@ -1211,10 +1350,16 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple nposting)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1381,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1426,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1500,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1536,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1646,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1652,6 +1833,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.level = ropaque->btpo.level;
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.replacepostingoff = replacepostingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1858,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
if (newitemonleft)
XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (replacepostingoff)
+ XLogRegisterBufData(0, (char *) nposting,
+ MAXALIGN(IndexTupleSize(nposting)));
+
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +2020,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2490,250 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= itemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_stash_item_tid(dedupState, itup);
+ else
+ _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating this page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ _bt_dedup_insert(newpage, dedupState);
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+static void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ to_insert = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1431,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c78c8e67b5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1774,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..a2484f3e3b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1127,6 +1138,112 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1141,9 +1258,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1385,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_stash_item_tid(dedupState, itup);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0966..e3d7f4ff0e 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
{
Size datalen;
char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ IndexTuple nposting = NULL;
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingsz > 0)
+ {
+ IndexTuple oposting;
+
+ Assert(isleaf);
+
+ /* oposting must be at offset before new item */
+ oposting = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+ if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
- left_hikeysz = 0;
+ left_hikeysz = 0,
+ npostingsz = 0;
Page newlpage;
OffsetNumber leftoff;
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datalen -= newitemsz;
}
+ if (xlrec->replacepostingoff)
+ {
+ Assert(xlrec->replacepostingoff ==
+ OffsetNumberPrev(xlrec->newitemoff));
+
+ nposting = (IndexTuple) datapos;
+ npostingsz = MAXALIGN(IndexTupleSize(nposting));
+ datapos += npostingsz;
+ datalen -= npostingsz;
+ }
+
/* Extract left hikey and its size (assuming 16-bit alignment) */
left_hikey = (IndexTuple) datapos;
left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == xlrec->replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb792ec..6f71b13199 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingsz %u",
+ xlrec->offnum, xlrec->postingsz);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,6 +39,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
+ /* FIXME: even master doesn't have newitemoff */
appendStringInfo(buf, "level %u, firstright %d",
xlrec->level, xlrec->firstright);
break;
@@ -46,8 +48,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6b00..3aa09744e0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,118 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +452,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +658,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +700,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +731,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +750,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -762,6 +934,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1001,6 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614da25..daa931377f 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingsz is not 0, data also contains 'nposting' -
+ * tuple to replace original.
+ *
+ * TODO probably it would be enough to keep just a flag to point
+ * out that data contains 'nposting' and compute its offset as
+ * we know it follows the tuple, but I am afraid that it will
+ * break alignment, will it?
+ *
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ uint32 postingsz;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
/*
* On insert with split, we save all the items going into the right sibling
@@ -96,6 +106,12 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -113,9 +129,10 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber replacepostingoff; /* offset of the posting item to replace */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +190,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
v11-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v11-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 3e6bd467c0a784962af6c1b00ac5563765901a6d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v11 2/2] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 91 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 108 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..b3ea978117 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid, max_htid;
id = PageGetItemId(page, offset);
@@ -283,16 +286,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +430,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +461,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
09.09.2019 22:54, Peter Geoghegan wrote:
Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.
Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
resolved
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.
Attached is v12, which contains WAL optimizations for posting split and
page
deduplication. Changes to prior version:
* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset' and repeats the logic of _bt_insertonpg() as you
proposed
upthread.
* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
possible
to fit it into some existing record, but I preferred to keep things clear.
I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?
* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
have
very few duplicates or pages that are already full of posting lists.
To be honest, I don't believe that incremental deduplication can really
improve
something, because no matter how many items were compressed we still
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v12-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v12-0001-Add-deduplication-to-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..50ec9ef 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters). When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed. Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..8fb17d6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple nposting,
+ OffsetNumber in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size itemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset)
+{
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,14 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1050,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1063,42 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1130,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ nposting, in_posting_offset);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1210,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1263,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.in_posting_offset = in_posting_offset;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1301,9 @@ _bt_insertonpg(Relation rel,
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (nposting)
+ XLogRegisterBufData(0, (char *) nposting,
+ IndexTupleSize(nposting));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1345,10 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
}
/*
@@ -1211,10 +1366,16 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple nposting, OffsetNumber in_posting_offset)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1397,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1405,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1273,6 +1442,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1516,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1552,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1662,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1653,6 +1850,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ /*
+ * If replacing posting item was put on the right page,
+ * we don't need to explicitly WAL log it because it's included
+ * with all the other items on the right page.
+ * Otherwise, save in_posting_offset and newitem to construct
+ * replacing tuple.
+ */
+ xlrec.in_posting_offset = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.in_posting_offset = in_posting_offset;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1672,8 +1880,11 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* is not stored if XLogInsert decides it needs a full-page image of
* the left page. We store the offset anyway, though, to support
* archive compression of these records.
+ *
+ * Also save newitem in case posting split was required
+ * to construct new posting.
*/
- if (newitemonleft)
+ if (newitemonleft || xlrec.in_posting_offset)
XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
/* Log the left page's new high key */
@@ -1834,7 +2045,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2515,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
+ */
+}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= itemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
*/
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_stash_item_tid(dedupState, itup, offnum);
+ else
+ _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating this page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ _bt_dedup_insert(newpage, dedupState);
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (dedupState->n_intervals == 0)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.n_intervals = dedupState->n_intervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /* only save non-empthy part of the array */
+ if (dedupState->n_intervals > 0)
+ XLogRegisterData((char *) dedupState->dedup_intervals,
+ dedupState->n_intervals * sizeof(dedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ to_insert = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
}
/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
+/*
* Delete item(s) from a btree page during single-page cleanup.
*
* As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1432,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..c78c8e6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
}
/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
+/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
* page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1611,6 +1775,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+ }
+}
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..4198770 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1128,6 +1139,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /* Save info about deduplicated items for future xlog record */
+ dedupState->n_intervals++;
+ /* Save offnum of the first item belongin to the group */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+ /* TODO just a debug message. delete it in final version of the patch */
+ if (itup_offnum != InvalidOffsetNumber)
+ elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+ dedupState->n_intervals,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..e3d7f4f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..de9bc3b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->in_posting_offset != InvalidOffsetNumber)
+ {
+ /* oposting must be at offset before new item */
+ ItemId itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+ IndexTuple newitem = (IndexTuple) datapos;
+ IndexTuple nposting;
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+ Assert(isleaf);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+
+ /* replace existing posting */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +291,43 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->in_posting_offset)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ /*
+ * Repeat logic implemented in _bt_insertonpg():
+ *
+ * If the new tuple is a duplicate with a heap TID that falls
+ * inside the range of an existing posting list tuple,
+ * generate new posting tuple to replace original one
+ * and update new tuple so that it's heap TID contains
+ * the rightmost heap TID of original posting tuple.
+ */
+ if (xlrec->in_posting_offset)
+ {
+ ItemId itemid = PageGetItemId(lpage, xlrec->newitemoff);
+ IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+ /* Alter new item offset, since effective new item changed */
+ xlrec->newitemoff = OffsetNumberNext(xlrec->newitemoff);
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +353,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -380,14 +438,146 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items
+ * to that in item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum, minoff, maxoff;
+ BTDedupState *dedupState = NULL;
+ char *data = ((char *) xlrec + SizeOfBtreeDedup);
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ int nth_interval = 0;
+ OffsetNumber n_dedup_tups = 0;
+
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ memcpy(dedup_intervals, data,
+ xlrec->n_intervals*sizeof(dedupInterval));
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ /*
+ * Instead of comparing tuple's keys, which may be costly, use
+ * information from xlog record. If current tuple belongs to the
+ * group of deduplicated items, repeat logic of _bt_dedup_one_page
+ * and stash it to form a posting list afterwards.
+ */
+ if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+ && n_dedup_tups < dedup_intervals[nth_interval].ntups)
+ {
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+ elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ /* count first tuple in the group */
+ if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+ n_dedup_tups++;
+
+ /* count added tuple */
+ n_dedup_tups++;
+ }
+ else
+ {
+ _bt_dedup_insert(newpage, dedupState);
+
+ /* reset state */
+ if (n_dedup_tups > 0)
+ nth_interval++;
+ n_dedup_tups = 0;
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
+static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +668,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -838,6 +1048,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..802e27b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; in_posting_offset %u",
+ xlrec->offnum, xlrec->in_posting_offset);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
+ /* FIXME: even master doesn't have newitemoff */
appendStringInfo(buf, "level %u, firstright %d",
xlrec->level, xlrec->firstright);
break;
}
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "items were deduplicated to %d items",
+ xlrec->n_intervals);
+ break;
+ }
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +143,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d1af18f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+ OffsetNumber from;
+ OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+
+ /*
+ * array with info about deduplicated items on the page.
+ *
+ * It contains one entry for each group of consecutive items that
+ * were deduplicated into a single posting tuple.
+ *
+ * This array is saved to xlog entry, which allows to replay
+ * deduplication faster without actually comparing tuple's keys.
+ */
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ /* current number of items in dedup_intervals array */
+ int n_intervals;
+ /* temp state variable to keep a 'possible' start of dedup interval */
+ OffsetNumber itupprev_off;
+
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +686,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -732,6 +931,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..075baaf 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* compactify tuples on the page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if in_posting_offset is valid, this is an insertion
+ * into existing posting tuple at offnum.
+ * redo must repeat logic of bt_insertonpg().
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber in_posting_offset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -96,6 +102,11 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -113,9 +124,26 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber in_posting_offset; /* offset inside posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+ int n_intervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +201,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I reviewed them and everything looks good except the idea of not
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be
resolved
by subsequent visibility checks.
Fair enough, but I didn't do that because it's compelling on its own
-- it isn't. I did it because it seemed like the best way to handle
posting list splits in a version of the patch where LP_DEAD bits can
be set on posting list tuples. I think that we have 3 high level
options here:
1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting
lists at all. This is clearly the easiest approach.
2. We do what I did in v11 of the patch -- we make it so that
_bt_insertonpg() and _bt_split() never have to deal with LP_DEAD
posting lists that they must split in passing.
3. We add additional code to _bt_insertonpg() and _bt_split() to deal
with the rare case where they must split an LP_DEAD posting list,
probably by unsetting the bit or something like that. Obviously it
would be wrong to leave the LP_DEAD bit set for the newly inserted
heap tuples TID that must go in a posting list that had its LP_DEAD
bit set -- that would make it dead to index scans even after its xact
successfully committed.
I think that you already agree that we want to have the
kill_prior_tuple optimizations with posting lists, so #1 isn't really
an option. That just leaves #2 and #3. Since posting list splits are
already assumed to be quite rare, it seemed far simpler to take the
conservative approach of forcing clean-up that removes LP_DEAD bits so
that _bt_insertonpg() and _bt_split() don't have to think about it.
Obviously I think it's important that we make as few changes as
possible to _bt_insertonpg() and _bt_split(), in general.
I don't understand what you mean about visibility checks. There is
nothing truly special about the way in which _bt_findinsertloc() will
sometimes have to kill LP_DEAD items so that _bt_insertonpg() and
_bt_split() don't have to think about LP_DEAD posting lists. As far as
recovery is concerned, it is just another XLOG_BTREE_DELETE record,
like any other. Note that there is a second call to
_bt_binsrch_insert() within _bt_findinsertloc() when it has to
generate a new XLOG_BTREE_DELETE record (by calling
_bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that
isn't dependent on the BTP_HAS_GARBAGE status bit being set).
Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.
There is no question that the little snippet of code I added to
_bt_killitems() in v11 is still too complicated. We also have to
consider cases where the array overflows because the scan direction
was changed (see the kill_prior_tuple comment block in btgetuple()).
Yeah, it's messy.
Attached is v12, which contains WAL optimizations for posting split and
page
deduplication.
Cool.
* xl_btree_split record doesn't contain posting tuple anymore, instead
it keeps
'in_posting offset' and repeats the logic of _bt_insertonpg() as you
proposed
upthread.
That looks good.
* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
info about
groups of tuples deduplicated into posting tuples. In principle, it is
possible
to fit it into some existing record, but I preferred to keep things clear.
I definitely think that inventing a new WAL record was the right thing to do.
I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?
I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)
* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
have
very few duplicates or pages that are already full of posting lists.
I think that the BT_COMPRESS_THRESHOLD idea is closely related to
making _bt_dedup_one_page() behave incrementally.
On my machine, v12 of the patch actually uses slightly more WAL than
v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of
nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL
compression, so if I turned that off v12 would probably win by a small
amount). Both numbers are wildly excessive, though. The master branch
figure is only 2011 MB, which is only about 1.8x the size of the index
on the master branch. And this is for a test case that makes the index
6.5x smaller, so the gap between total index size and total WAL volume
is huge here -- the volume of WAL is nearly 40x greater than the index
size!
You are right to wonder what the result would be if we put
BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce
the volume of WAL, because _bt_dedup_one_page() would no longer
"thrash". However, I strongly suspect that that wouldn't be good
enough at reducing the WAL volume down to something acceptable. That
will require an approach to WAL-logging that is much more logical than
physical. The nbtree_wal_test.sql test case involves a case where page
splits mostly don't WAL-log things that were previously WAL-logged by
simple inserts, because nbtsplitloc.c has us split in a right-heavy
fashion when there are lots of duplicates. In other words, the
_bt_split() optimization to WAL volume naturally works very well with
the test case, or really any case with lots of duplicates, so the
"write amplification" to the total volume of WAL is relatively small
on the master branch.
I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally. Otherwise you end up
needlessly rewriting things that didn't actually change way too often.
That's definitely not okay. Why worry about bringing 40x down to 20x,
or even 10x? It needs to be comparable to the master branch.
To be honest, I don't believe that incremental deduplication can really
improve
something, because no matter how many items were compressed we still
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?
The point of being incremental is not to save work in cases where a
page split is inevitable anyway. Rather, the idea is that we can be
even more lazy, and avoid doing work that will never be needed --
maybe delaying page splits actually means preventing them entirely.
Or, we can spread out the work over time, so that the amount of WAL
per checkpoint is smoother than what we would get with a batch
approach. My mental model of page splits is that there are sometimes
many of them on the same page again and again in a very short time
period, but more often the chances of any individual page being split
is low. Even the rightmost page of a serial PK index isn't truly an
exception, because a new rightmost page isn't "the same page" as the
original rightmost page -- it is its new right sibling.
Since we're going to have to optimize the WAL logging anyway, it will
be relatively easy to experiment with incremental deduplication within
_bt_dedup_one_page(). The WAL logging is the the hard part, so let's
focus on that rather than worrying too much about whether or not
incrementally doing all the work (not just the WAL logging) makes
sense. It's still too early to be sure about whether or not that's a
good idea.
--
Peter Geoghegan
Attachments:
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Attached is v12, which contains WAL optimizations for posting split and
page
deduplication.
Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary:
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30423A0; LSN 0/30425A0:
prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 20 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30425A0; LSN 0/3042F78:
prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13
FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3042F78; LSN 0/3043788:
prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043788; LSN 0/30437C0:
prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11
15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30437C0; LSN 0/3043A30:
prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 9 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043A30; LSN 0/3043D08:
prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043D08; LSN 0/3044948:
prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044948; LSN 0/3044B58:
prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 21 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044B58; LSN 0/30454A0:
prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30454A0; LSN 0/3045CC0:
prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045CC0; LSN 0/3045F48:
prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 10 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045F48; LSN 0/3046240:
prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046240; LSN 0/3046E70:
prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046E70; LSN 0/3047090:
prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 22 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3047090; LSN 0/30479E0:
prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30479E0; LSN 0/3048420:
prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38
FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048420; LSN 0/30486B0:
prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0
FPW - Heap/INSERT: off 6 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30486B0; LSN 0/3048C30:
prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2
FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048C30; LSN 0/3049668:
prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1
FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3049668; LSN 0/304A550:
prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1
FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1
4448/2019-09-11 15:01:06 PDT FATAL: inconsistent page found, rel
1663/16385/3455, forknum 0, blkno 1
4448/2019-09-11 15:01:06 PDT CONTEXT: WAL redo at 0/3049668 for
Btree/INSERT_LEAF: off 2; in_posting_offset 1
4447/2019-09-11 15:01:06 PDT LOG: startup process (PID 4448) exited
with exit code 1
4447/2019-09-11 15:01:06 PDT LOG: terminating any other active server processes
4447/2019-09-11 15:01:06 PDT LOG: database system is shut down
I regularly use this test case for the patch -- I think that I fixed a
similar problem in v11, when I changed the same WAL logging, but I
didn't mention it until now. I will debug this myself in a few days,
though you may prefer to do it before then.
--
Peter Geoghegan
On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary
I see the bug here. The problem is that we WAL-log a version of the
new item that already has its heap TID changed. On the primary, the
call to _bt_form_newposting() has a new item with the original heap
TID, which is then rewritten before being inserted -- that's correct.
But during recovery, we *start out with* a version of the new item
that *already* had its heap TID swapped. So we have nowhere to get the
original heap TID from during recovery.
Attached patch fixes the problem in a hacky way -- it WAL-logs the
original heap TID, just in case. Obviously this fix isn't usable, but
it should make the problem clearer.
Can you come up with a proper fix, please? I can think of one way of
doing it, but I'll leave the details to you.
The same issue exists in _bt_split(), so the tests will still fail
with wal_consistency_checking -- it just takes a lot longer to reach a
point where an inconsistent page is found, because posting list splits
that occur at the same point that we need to split a page are much
rarer than posting list splits that occur when we simply need to
insert, without splitting the page. I suggest using
wal_consistency_checking to test the fix that you come up with. As I
mentioned, I regularly use it. Also note that there are further
subtleties to doing this within _bt_split() -- see the FIXME comments
there.
Thanks
--
Peter Geoghegan
Attachments:
0001-Save-original-new-heap-TID-in-insert-WAL-record.patchapplication/octet-stream; name=0001-Save-original-new-heap-TID-in-insert-WAL-record.patchDownload
From 8efe8f8f94d8f3195ba65b964799ca2c75f971fd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 11 Sep 2019 17:46:11 -0700
Subject: [PATCH] Save original new heap TID in insert WAL record.
---
src/backend/access/nbtree/nbtinsert.c | 14 ++++++++++++++
src/backend/access/nbtree/nbtxlog.c | 3 +++
src/include/access/nbtxlog.h | 4 +++-
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 8fb17d6784..119e3fe5a6 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1037,6 +1037,7 @@ _bt_insertonpg(Relation rel,
Size itemsz;
IndexTuple nposting = NULL;
IndexTuple oposting;
+ ItemPointerData orig;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1061,6 +1062,7 @@ _bt_insertonpg(Relation rel,
itemsz = IndexTupleSize(itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ memset(&orig, 0, sizeof(ItemPointerData));
/*
* Do we need to split an existing posting list item?
@@ -1092,6 +1094,8 @@ _bt_insertonpg(Relation rel,
Assert(in_posting_offset > 0);
oposting = (IndexTuple) PageGetItem(page, itemid);
+ /* HACK Save orig heap TID for WAL logging */
+ ItemPointerCopy(&itup->t_tid, &orig);
nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
/* Alter new item offset, since effective new item changed */
@@ -1264,6 +1268,7 @@ _bt_insertonpg(Relation rel,
xlrec.offnum = itup_off;
xlrec.in_posting_offset = in_posting_offset;
+ xlrec.orig = orig;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1856,6 +1861,15 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* with all the other items on the right page.
* Otherwise, save in_posting_offset and newitem to construct
* replacing tuple.
+ *
+ * FIXME: The same "original new item TID vs. rewritten new item TID"
+ * issue exists here, but I haven't done anything with that.
+ *
+ * FIXME: Be careful about splits where the new item is also the first
+ * item on the right half -- that would make the posting list that we
+ * have to update in-place the last item on the left. This is hard to
+ * test because nbtsplitloc.c will avoid choosing a split point
+ * between these two.
*/
xlrec.in_posting_offset = InvalidOffsetNumber;
if (replacepostingoff < firstright)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index de9bc3b101..5bb38beda1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -189,6 +189,9 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
IndexTuple newitem = (IndexTuple) datapos;
IndexTuple nposting;
+ /* Restore newitem to actual original state in _bt_insertonpg() */
+ newitem = CopyIndexTuple(newitem);
+ ItemPointerCopy(&xlrec->orig, &newitem->t_tid);
nposting = _bt_form_newposting(newitem, oposting,
xlrec->in_posting_offset);
Assert(isleaf);
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 075baaf6eb..2813e569dc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -15,6 +15,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/itemptr.h"
#include "storage/off.h"
/*
@@ -74,9 +75,10 @@ typedef struct xl_btree_insert
{
OffsetNumber offnum;
OffsetNumber in_posting_offset;
+ ItemPointerData orig;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, orig) + sizeof(ItemPointerData))
/*
* On insert with split, we save all the items going into the right sibling
--
2.17.1
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.
Attached is v13 of the patch, which shows what I mean. You could say
that v13 makes _bt_dedup_one_page() do a few extra things that are
kind of similar to the things that nbtsplitloc.c does for _bt_split().
More specifically, the v13-0001-* patch includes code that makes
_bt_dedup_one_page() "goal orientated" -- it calculates how much space
will be freed when _bt_dedup_one_page() goes on to deduplicate those
items on the page that it has already "decided to deduplicate". The
v13-0002-* patch makes _bt_dedup_one_page() actually use this ability
-- it makes _bt_dedup_one_page() give up on deduplication when it is
clear that the items that are already "pending deduplication" will
free enough space for its caller to at least avoid a page split. This
revision of the patch doesn't truly make deduplication incremental. It
is only a proof of concept that shows how _bt_dedup_one_page() can
*decide* that it will free "enough" space, whatever that may mean, so
that it can finish early. The task of making _bt_dedup_one_page()
actually avoid lots of work when it finishes early remains.
As I said yesterday, I'm not asking you to accept that v13-0002-* is
an improvement. At least not yet. In fact, "finishes early" due to the
v13-0002-* logic clearly makes everything a lot slower, since
_bt_dedup_one_page() will "thrash" even more than earlier versions of
the patch. This is especially problematic with WAL-logged relations --
the test case that I shared yesterday goes from about 6GB to 10GB with
v13-0002-* applied. But we need to fundamentally rethink the approach
to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note
that total index space utilization is barely affected by the
v13-0002-* patch, so clearly that much works well.)
Other changes:
* Small tweaks to amcheck (nothing interesting, really).
* Small tweaks to the _bt_killitems() stuff.
* Moved all of the deduplication helper functions to nbtinsert.c. This
is where deduplication gets complicated, so I think that it should all
live there. (i.e. nbtsort.c will call nbtinsert.c code, never the
other way around.)
Note that I haven't merged any of the changes from v12 of the patch
from yesterday. I didn't merge the posting list WAL logging changes
because of the bug I reported, but I would have were it not for that.
The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to
be more efficient than your original approach (i.e. calling
log_newpage_buffer()), so I have stuck with your original approach.
It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.
--
Peter Geoghegan
Attachments:
v13-0002-Stop-deduplicating-when-a-page-split-is-avoided.patchapplication/octet-stream; name=v13-0002-Stop-deduplicating-when-a-page-split-is-avoided.patchDownload
From a7d4cafc92358e6095a48c0b42ccbe06b7b8bd5f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 12 Sep 2019 16:19:54 -0700
Subject: [PATCH v13 2/3] Stop deduplicating when a page split is avoided.
Currently this is a big loss for performance, especially with WAL-logged
relations, though it barely affects total space utilization compared to
recent versions of the patch. With incremental rewriting of the page
and incremental WAL logging, this could actually be a win for
performance.
In any case it seems like a good thing for deduplication to be able to
operate in a "goal-orientated" way. The exact details will need to be
validated by extensive benchmarking.
---
src/backend/access/nbtree/nbtinsert.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 52651fcbe4..f3b945edf9 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2675,6 +2675,21 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
pagesaving += _bt_dedup_insert(newpage, dedupState);
}
+ /*
+ * When we have deduplicated enough to avoid page split, don't bother
+ * deduplicating any more items.
+ *
+ * FIXME: If rewriting the page and doing the WAL logging were
+ * incremental, we could actually break out of the loop and save real
+ * work. As things stand this is a loss for performance, but it
+ * barely affects space utilization. (The number of blocks are the
+ * same as before, except for rounding effects. The minimum number of
+ * items on each page for each index "increases" when this is enabled,
+ * however.)
+ */
+ if (pagesaving >= newitemsz)
+ deduplicate = false;
+
pfree(dedupState->itupprev);
dedupState->itupprev = CopyIndexTuple(itup);
}
--
2.17.1
v13-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v13-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 711db4cd083528bb9c39cd66ed9faee0141e108a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v13 3/3] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 91 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 108 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..b3ea978117 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,7 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid, max_htid;
id = PageGetItemId(page, offset);
@@ -283,16 +286,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +430,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +461,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +547,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v13-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v13-0001-Add-deduplication-to-nbtree.patchDownload
From 7b25e930eb60750e1e8c9f31182fb6ac8e6dfac0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v13 1/3] Add deduplication to nbtree.
---
contrib/amcheck/verify_nbtree.c | 164 +++++--
src/backend/access/nbtree/README | 74 +++-
src/backend/access/nbtree/nbtinsert.c | 555 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 148 ++++++-
src/backend/access/nbtree/nbtree.c | 147 +++++--
src/backend/access/nbtree/nbtsearch.c | 243 ++++++++++-
src/backend/access/nbtree/nbtsort.c | 148 ++++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 +-
src/backend/access/nbtree/nbtutils.c | 253 +++++++++--
src/backend/access/nbtree/nbtxlog.c | 88 +++-
src/backend/access/rmgrdesc/nbtdesc.c | 16 +-
src/include/access/nbtree.h | 242 +++++++++--
src/include/access/nbtxlog.h | 36 +-
src/tools/valgrind.supp | 21 +
14 files changed, 1998 insertions(+), 184 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..83519cb7cf 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..52651fcbe4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple nposting);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
+static Size _bt_dedup_insert(Page page, BTDedupState *dedupState);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +947,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +962,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +981,14 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1002,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1014,72 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Replace newitem's heap TID with rightmost heap TID from original
+ * posting list
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1112,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ nposting);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1192,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1245,9 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingsz = 0;
+ if (nposting)
+ xlrec.postingsz = MAXALIGN(IndexTupleSize(itup));
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1153,6 +1285,9 @@ _bt_insertonpg(Relation rel,
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (nposting)
+ XLogRegisterBufData(0, (char *) nposting,
+ IndexTupleSize(nposting));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1329,10 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
}
/*
@@ -1211,10 +1350,16 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple nposting)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1381,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1426,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1500,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1536,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1646,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1652,6 +1833,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.level = ropaque->btpo.level;
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.replacepostingoff = replacepostingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1676,6 +1858,10 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
if (newitemonleft)
XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (replacepostingoff)
+ XLogRegisterBufData(0, (char *) nposting,
+ MAXALIGN(IndexTupleSize(nposting)));
+
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
item = (IndexTuple) PageGetItem(origpage, itemid);
@@ -1834,7 +2020,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2490,343 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_dedup_item_tid(dedupState, itup);
+ else
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating items on this
+ * page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_dedup_item_tid(BTDedupState *dedupState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+ dedupState->alltupsize =
+ MAXALIGN(IndexTupleSize(dedupState->itupprev)) +
+ sizeof(ItemIdData);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ dedupState->alltupsize +=
+ MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ *
+ * Returns space saving on page.
+ */
+static Size
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple itup;
+ Size spacesaving = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ itup = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+
+ spacesaving = dedupState->alltupsize -
+ (MAXALIGN(IndexTupleSize(postingtuple)) + sizeof(ItemIdData));
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+ itup = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ if (PageAddItem(page, (Item) itup, IndexTupleSize(itup),
+ OffsetNumberNext(PageGetMaxOffsetNumber(page)), false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(itup);
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+
+ return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1431,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..af5e136af7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same truncated IndexTuple for every
+ * logical tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..9f193768f2 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1127,6 +1138,40 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1141,9 +1186,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1313,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_dedup_item_tid(dedupState, itup);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..f7575ed48c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2694,74 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..d4d7c09ff0 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -178,12 +178,34 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
{
Size datalen;
char *datapos = XLogRecGetBlockData(record, 0, &datalen);
+ IndexTuple nposting = NULL;
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingsz > 0)
+ {
+ IndexTuple oposting;
+
+ Assert(isleaf);
+
+ /* oposting must be at offset before new item */
+ oposting = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, OffsetNumberPrev(xlrec->offnum)));
+ if (PageAddItem(page, (Item) datapos, xlrec->postingsz,
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ nposting = (IndexTuple) (datapos + xlrec->postingsz);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,9 +287,11 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
- left_hikeysz = 0;
+ left_hikeysz = 0,
+ npostingsz = 0;
Page newlpage;
OffsetNumber leftoff;
@@ -281,6 +305,17 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
datalen -= newitemsz;
}
+ if (xlrec->replacepostingoff)
+ {
+ Assert(xlrec->replacepostingoff ==
+ OffsetNumberPrev(xlrec->newitemoff));
+
+ nposting = (IndexTuple) datapos;
+ npostingsz = MAXALIGN(IndexTupleSize(nposting));
+ datapos += npostingsz;
+ datalen -= npostingsz;
+ }
+
/* Extract left hikey and its size (assuming 16-bit alignment) */
left_hikey = (IndexTuple) datapos;
left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
@@ -304,6 +339,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == xlrec->replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, npostingsz,
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -386,8 +430,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +522,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..71763da4c8 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingsz %u",
+ xlrec->offnum, xlrec->postingsz);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,21 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, replacepostingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->replacepostingoff);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..eade328511 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,119 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ Size alltupsize;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +453,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +659,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +701,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +732,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +751,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -732,6 +905,7 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_item_tid(BTDedupState *dedupState, IndexTuple itup);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +936,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +988,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..35a65522f7 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -61,16 +61,26 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingsz is not 0, data also contains 'nposting' -
+ * tuple to replace original.
+ *
+ * TODO probably it would be enough to keep just a flag to point
+ * out that data contains 'nposting' and compute its offset as
+ * we know it follows the tuple, but I am afraid that it will
+ * break alignment, will it?
+ *
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ uint32 postingsz;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingsz) + sizeof(uint32))
/*
* On insert with split, we save all the items going into the right sibling
@@ -95,6 +105,12 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'nposting',
+ * which must replace original posting tuple at replaceitemoff offset.
+ * TODO further optimization is to add it to xlog only if it remains on the
+ * left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -112,9 +128,10 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber replacepostingoff; /* offset of the posting item to replace */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, replacepostingoff) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +189,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
On Tue, Sep 1, 2015 at 12:33 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Hi, Tomas!
On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good.
But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no?
Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.
Is it worth to make a provision to add an ability to control how
duplicates are sorted ? If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?
Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. At least with same semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That is extremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface, yet.but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates.
From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostly because of #1.
Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists?
I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree access method in PostgreSQL, not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of posting tree.One can say that concurrency would became worse because index would become smaller and number of pages would became smaller too. Since number of pages would be smaller, backends are more likely concur for the same page. But this argument can be user against any compression and for any bloat.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
13.09.2019 4:04, Peter Geoghegan wrote:
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.
Attached is v14 based on v12 (v13 changes are not merged).
In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.
I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they
trigger.
Here are test results (config is standard, except fsync=off to speedup
tests):
'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
+-------------------+-----------+-----------+----------------+-----------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-------------------+-----------+-----------+----------------+-----------+
| time | 09:12 min | 06:56 min | 06:24 min | 08:10 min |
| nbtree_wal_volume | 8083 MB | 2128 MB | 2327 MB | 2439 MB |
| index_size | 169 MB | 169 MB | 169 MB | 1118 MB |
+-------------------+-----------+-----------+----------------+-----------+
With random insertions into btree it's highly possible that
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer()
approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
Also, since the index is packed way better than before, it probably
benefits less of wal_compression.
One possible "fix" to decrease WAL amplification is to add
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of
inadequate WAL amount.
However, I doubt that it is a crash-safe idea.
Another, and more realistic approach is to make deduplication less
intensive:
if freed space is less than some threshold, fall back to not changing
page at all and not generating xlog record.
Probably that was the reason, why patch became faster after I added
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because
WAL load decreased.
So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?
I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;
+-----------------+---------+---------+----------------+--------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-----------------+---------+---------+----------------+--------+
| checkpoints_req | 16 | 7 | 8 | 10 |
+-----------------+---------+---------+----------------+--------+
And I struggle to explain the reason of this.
Do you understand what can cause the difference?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v14-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v14-0001-Add-deduplication-to-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..50ec9ef 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters). When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed. Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..605865e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+ OffsetNumber in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size itemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset)
+{
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
+ IndexTuple original_itup = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1064,47 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ original_itup = CopyIndexTuple(itup);
+
+ nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ original_itup, nposting, in_posting_offset);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.in_posting_offset = in_posting_offset;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (original_itup)
+ XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+ else
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
+ if (original_itup)
+ pfree(original_itup);
+
}
/*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple original_newitem,
+ IndexTuple nposting, OffsetNumber in_posting_offset)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1407,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1415,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ /*
+ * If replacing posting item was put on the right page,
+ * we don't need to explicitly WAL log it because it's included
+ * with all the other items on the right page.
+ * Otherwise, save in_posting_offset and newitem to construct
+ * replacing tuple.
+ */
+ xlrec.in_posting_offset = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.in_posting_offset = in_posting_offset;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* is not stored if XLogInsert decides it needs a full-page image of
* the left page. We store the offset anyway, though, to support
* archive compression of these records.
+ *
+ * Also save newitem in case posting split was required
+ * to construct new posting.
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.in_posting_offset)
+ {
+ if (xlrec.in_posting_offset)
+ {
+ Assert(original_newitem != NULL);
+ Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+ XLogRegisterBufData(0, (char *) original_newitem,
+ MAXALIGN(IndexTupleSize(original_newitem)));
+ }
+ else
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2536,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
+ */
+}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
*/
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= itemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_stash_item_tid(dedupState, itup, offnum);
+ else
+ _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating this page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ _bt_dedup_insert(newpage, dedupState);
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (dedupState->n_intervals == 0)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.n_intervals = dedupState->n_intervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /* only save non-empthy part of the array */
+ if (dedupState->n_intervals > 0)
+ XLogRegisterData((char *) dedupState->dedup_intervals,
+ dedupState->n_intervals * sizeof(dedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ to_insert = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
}
/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
+/*
* Delete item(s) from a btree page during single-page cleanup.
*
* As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1432,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..c78c8e6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
}
/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
+/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
* page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1611,6 +1775,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+ }
+}
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..4198770 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1128,6 +1139,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /* Save info about deduplicated items for future xlog record */
+ dedupState->n_intervals++;
+ /* Save offnum of the first item belongin to the group */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+ /* TODO just a debug message. delete it in final version of the patch */
+ if (itup_offnum != InvalidOffsetNumber)
+ elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+ dedupState->n_intervals,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..e3d7f4f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..98ce964 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->in_posting_offset != InvalidOffsetNumber)
+ {
+ /* oposting must be at offset before new item */
+ ItemId itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+ IndexTuple newitem = (IndexTuple) datapos;
+ IndexTuple nposting;
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+ Assert(isleaf);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+
+ /* replace existing posting */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->in_posting_offset)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ /*
+ * Repeat logic implemented in _bt_insertonpg():
+ *
+ * If the new tuple is a duplicate with a heap TID that falls
+ * inside the range of an existing posting list tuple,
+ * generate new posting tuple to replace original one
+ * and update new tuple so that it's heap TID contains
+ * the rightmost heap TID of original posting tuple.
+ */
+ if (xlrec->in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+ IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+
+ /* Alter new item offset, since effective new item changed */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -380,14 +440,146 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items
+ * to that in item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum, minoff, maxoff;
+ BTDedupState *dedupState = NULL;
+ char *data = ((char *) xlrec + SizeOfBtreeDedup);
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ int nth_interval = 0;
+ OffsetNumber n_dedup_tups = 0;
+
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ memcpy(dedup_intervals, data,
+ xlrec->n_intervals*sizeof(dedupInterval));
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ /*
+ * Instead of comparing tuple's keys, which may be costly, use
+ * information from xlog record. If current tuple belongs to the
+ * group of deduplicated items, repeat logic of _bt_dedup_one_page
+ * and stash it to form a posting list afterwards.
+ */
+ if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+ && n_dedup_tups < dedup_intervals[nth_interval].ntups)
+ {
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+ elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ /* count first tuple in the group */
+ if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+ n_dedup_tups++;
+
+ /* count added tuple */
+ n_dedup_tups++;
+ }
+ else
+ {
+ _bt_dedup_insert(newpage, dedupState);
+
+ /* reset state */
+ if (n_dedup_tups > 0)
+ nth_interval++;
+ n_dedup_tups = 0;
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
+static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +670,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -838,6 +1050,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..802e27b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; in_posting_offset %u",
+ xlrec->offnum, xlrec->in_posting_offset);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
+ /* FIXME: even master doesn't have newitemoff */
appendStringInfo(buf, "level %u, firstright %d",
xlrec->level, xlrec->firstright);
break;
}
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "items were deduplicated to %d items",
+ xlrec->n_intervals);
+ break;
+ }
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +143,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d1af18f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+ OffsetNumber from;
+ OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+
+ /*
+ * array with info about deduplicated items on the page.
+ *
+ * It contains one entry for each group of consecutive items that
+ * were deduplicated into a single posting tuple.
+ *
+ * This array is saved to xlog entry, which allows to replay
+ * deduplication faster without actually comparing tuple's keys.
+ */
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ /* current number of items in dedup_intervals array */
+ int n_intervals;
+ /* temp state variable to keep a 'possible' start of dedup interval */
+ OffsetNumber itupprev_off;
+
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +686,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -732,6 +931,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..075baaf 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* compactify tuples on the page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if in_posting_offset is valid, this is an insertion
+ * into existing posting tuple at offnum.
+ * redo must repeat logic of bt_insertonpg().
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber in_posting_offset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -96,6 +102,11 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -113,9 +124,26 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (if placed on left page) */
+ OffsetNumber in_posting_offset; /* offset inside posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+ int n_intervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +201,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Attached is v14 based on v12 (v13 changes are not merged).
In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.
Cool.
I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.
With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.
I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.
It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.
I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.
"pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.
(Thinks some more...)
I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():
/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);
Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().
That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:
1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).
2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.
3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.
4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.
We should try to be careful when using these terms, as there is a very
real danger of talking past each other.
Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.
I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.
Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.
I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.
We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.
I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.
Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.
So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?
I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".
The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:
+ /* + * Project size of new posting list that would result from merging + * current tup with pending posting list (could just be prev item + * that's "pending"). + * + * This accounting looks odd, but it's correct because ... + */ + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) + + (dedupState->ntuples + itup_ntuples + 1) * + sizeof(ItemPointerData));
Why the "+1" here?
I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).
It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.
I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;
And I struggle to explain the reason of this.
Do you understand what can cause the difference?
I imagine that the additional WAL volume triggered a checkpoint
earlier than in the more favorable test, which indirectly triggered
more FPIs, which contributed to triggering a checkpoint even
earlier...and so on. Synthetic test cases can avoid this. A useful
synthetic test should have no checkpoints at all, so that we can see
the broken down costs, without any second order effects that add more
cost in weird ways.
--
Peter Geoghegan
On Mon, Sep 16, 2019 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);
I can confirm that this is what the problem was. Attached are two patches:
* A version of your v14 from today with a couple of tiny changes to
make it work against the current master branch -- I had to rebase the
patch, but the changes made while rebasing were totally trivial. (I
like to keep CFTester green.)
* The second patch actually fixes the PageSetLSN() thing, setting the
temp page buffer's LSN to match the original page before any real work
is done, and before XLogInsert() is called. Just like _bt_split().
The test case now shows exactly what you reported for "FPWs off" when
FPWs are turned on, at least on my machine and with my checkpoint
settings. That is, there are *zero* FPIs/FPWs, so the final nbtree
volume is 2128 MB. This means that the volume of additional WAL
required over what the master branch requires for the same test case
is very small (2128 MB compares well with master's 2011 MB of WAL).
Maybe we could do better than 2128 MB with more work, but this is
definitely already low enough overhead to be acceptable. This also
passes "make check-world" testing.
However, my usual wal_consistency_checking smoke test fails pretty
quickly with the two patches applied:
3634/2019-09-16 13:53:22 PDT FATAL: inconsistent page found, rel
1663/16385/2673, forknum 0, blkno 13
3634/2019-09-16 13:53:22 PDT CONTEXT: WAL redo at 0/3202370 for
Btree/DEDUPLICATE: items were deduplicated to 12 items
3633/2019-09-16 13:53:22 PDT LOG: startup process (PID 3634) exited
with exit code 1
Maybe the lack of the PageSetLSN() thing masked a bug in WAL replay,
since without that we effectively always just replay FPIs, never truly
relying on redo. (I didn't try wal_consistency_checking without the
second patch, but I assume that you did, and found no problems for
this reason.)
Can you produce a new version that integrates the PageSetLSN() thing,
and fixes this bug?
Thanks
--
Peter Geoghegan
Attachments:
v141-0002-Add-_bt_split-style-WAL-optimization.patchapplication/octet-stream; name=v141-0002-Add-_bt_split-style-WAL-optimization.patchDownload
From d39f41ff50e8a72e5228a92102434e600d65a943 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 16 Sep 2019 13:39:21 -0700
Subject: [PATCH v141 2/2] Add _bt_split() style WAL optimization.
---
src/backend/access/nbtree/nbtinsert.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 605865e85e..a3b7cee0c5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -2635,6 +2635,13 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
newpage = PageGetTempPageCopySpecial(page);
nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+ /*
+ * Copy the original page's LSN into newpage, which will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ PageSetLSN(newpage, PageGetLSN(page));
+
/* Make sure that new page won't have garbage flag set */
nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
--
2.17.1
v141-0001-v14-0001-Add-deduplication-to-nbtree.patch-from.patchapplication/octet-stream; name=v141-0001-v14-0001-Add-deduplication-to-nbtree.patch-from.patchDownload
From a4d17804d9980f845f6a64f61629e0bfde0906bd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 16 Sep 2019 13:26:58 -0700
Subject: [PATCH v141 1/2] v14-0001-Add-deduplication-to-nbtree.patch from
Anastasia
---
contrib/amcheck/verify_nbtree.c | 128 +++++-
src/backend/access/nbtree/README | 76 +++-
src/backend/access/nbtree/nbtinsert.c | 541 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 148 ++++++-
src/backend/access/nbtree/nbtree.c | 147 +++++--
src/backend/access/nbtree/nbtsearch.c | 247 ++++++++++-
src/backend/access/nbtree/nbtsort.c | 243 ++++++++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 +-
src/backend/access/nbtree/nbtutils.c | 264 ++++++++++--
src/backend/access/nbtree/nbtxlog.c | 241 ++++++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 27 +-
src/include/access/nbtree.h | 275 ++++++++++--
src/include/access/nbtxlog.h | 49 ++-
src/tools/valgrind.supp | 21 +
14 files changed, 2268 insertions(+), 186 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..399743d4d6 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..50ec9ef48c 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters). When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed. Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..605865e85e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+ OffsetNumber in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size itemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset)
+{
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
+ IndexTuple original_itup = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1063,47 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ original_itup = CopyIndexTuple(itup);
+
+ nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ original_itup, nposting, in_posting_offset);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.in_posting_offset = in_posting_offset;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (original_itup)
+ XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+ else
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
+ if (original_itup)
+ pfree(original_itup);
+
}
/*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple original_newitem,
+ IndexTuple nposting, OffsetNumber in_posting_offset)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1407,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ /*
+ * If replacing posting item was put on the right page,
+ * we don't need to explicitly WAL log it because it's included
+ * with all the other items on the right page.
+ * Otherwise, save in_posting_offset and newitem to construct
+ * replacing tuple.
+ */
+ xlrec.in_posting_offset = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.in_posting_offset = in_posting_offset;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* is not stored if XLogInsert decides it needs a full-page image of
* the left page. We store the offset anyway, though, to support
* archive compression of these records.
+ *
+ * Also save newitem in case posting split was required
+ * to construct new posting.
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.in_posting_offset)
+ {
+ if (xlrec.in_posting_offset)
+ {
+ Assert(original_newitem != NULL);
+ Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+ XLogRegisterBufData(0, (char *) original_newitem,
+ MAXALIGN(IndexTupleSize(original_newitem)));
+ }
+ else
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2536,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= itemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_stash_item_tid(dedupState, itup, offnum);
+ else
+ _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating this page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ _bt_dedup_insert(newpage, dedupState);
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (dedupState->n_intervals == 0)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.n_intervals = dedupState->n_intervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /* only save non-empthy part of the array */
+ if (dedupState->n_intervals > 0)
+ XLogRegisterData((char *) dedupState->dedup_intervals,
+ dedupState->n_intervals * sizeof(dedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ to_insert = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..5314bbe2a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..67595319d7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1431,41 @@ restart:
}
}
+/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c78c8e67b5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup);
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1774,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..4198770303 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1127,6 +1138,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
+/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /* Save info about deduplicated items for future xlog record */
+ dedupState->n_intervals++;
+ /* Save offnum of the first item belongin to the group */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+ /* TODO just a debug message. delete it in final version of the patch */
+ if (itup_offnum != InvalidOffsetNumber)
+ elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+ dedupState->n_intervals,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..d4710501a1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2174,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2208,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2217,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2242,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2261,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2274,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2285,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2303,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2312,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2394,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2437,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2520,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2587,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2618,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2692,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..98ce964ea9 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->in_posting_offset != InvalidOffsetNumber)
+ {
+ /* oposting must be at offset before new item */
+ ItemId itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+ IndexTuple newitem = (IndexTuple) datapos;
+ IndexTuple nposting;
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+ Assert(isleaf);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+
+ /* replace existing posting */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->in_posting_offset)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ /*
+ * Repeat logic implemented in _bt_insertonpg():
+ *
+ * If the new tuple is a duplicate with a heap TID that falls
+ * inside the range of an existing posting list tuple,
+ * generate new posting tuple to replace original one
+ * and update new tuple so that it's heap TID contains
+ * the rightmost heap TID of original posting tuple.
+ */
+ if (xlrec->in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+ IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+
+ /* Alter new item offset, since effective new item changed */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -379,6 +439,138 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items
+ * to that in item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum, minoff, maxoff;
+ BTDedupState *dedupState = NULL;
+ char *data = ((char *) xlrec + SizeOfBtreeDedup);
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ int nth_interval = 0;
+ OffsetNumber n_dedup_tups = 0;
+
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ memcpy(dedup_intervals, data,
+ xlrec->n_intervals*sizeof(dedupInterval));
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ /*
+ * Instead of comparing tuple's keys, which may be costly, use
+ * information from xlog record. If current tuple belongs to the
+ * group of deduplicated items, repeat logic of _bt_dedup_one_page
+ * and stash it to form a posting list afterwards.
+ */
+ if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+ && n_dedup_tups < dedup_intervals[nth_interval].ntups)
+ {
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+ elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ /* count first tuple in the group */
+ if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+ n_dedup_tups++;
+
+ /* count added tuple */
+ n_dedup_tups++;
+ }
+ else
+ {
+ _bt_dedup_insert(newpage, dedupState);
+
+ /* reset state */
+ if (n_dedup_tups > 0)
+ nth_interval++;
+ n_dedup_tups = 0;
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +578,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +670,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -838,6 +1050,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..7351cad1d2 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; in_posting_offset %u",
+ xlrec->offnum, xlrec->in_posting_offset);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,29 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->in_posting_offset);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "items were deduplicated to %d items",
+ xlrec->n_intervals);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +145,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d1af18f864 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+ OffsetNumber from;
+ OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+
+ /*
+ * array with info about deduplicated items on the page.
+ *
+ * It contains one entry for each group of consecutive items that
+ * were deduplicated into a single posting tuple.
+ *
+ * This array is saved to xlog entry, which allows to replay
+ * deduplication faster without actually comparing tuple's keys.
+ */
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ /* current number of items in dedup_intervals array */
+ int n_intervals;
+ /* temp state variable to keep a 'possible' start of dedup interval */
+ OffsetNumber itupprev_off;
+
+ int ntuples;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +685,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +929,11 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..7d41adccac 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* compactify tuples on the page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if in_posting_offset is valid, this is an insertion
+ * into existing posting tuple at offnum.
+ * redo must repeat logic of bt_insertonpg().
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber in_posting_offset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,11 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -112,9 +123,26 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber in_posting_offset; /* offset inside posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+ int n_intervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +200,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
16.09.2019 21:58, Peter Geoghegan wrote:
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.<...>
I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():/*
* Copy the original page's LSN into leftpage, which will become the
* updated version of the page. We need this because XLogInsert will
* examine the LSN and possibly dump it in a page image.
*/
PageSetLSN(leftpage, PageGetLSN(origpage));
isleaf = P_ISLEAF(oopaque);Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().
That's it. I suspected that such enormous amount of FPW reflects some bug.
That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
My point was that the problem is extra FPWs, so it doesn't matter
whether we deduplicate just several entries to free enough space or all
of them.
The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).
Sure.
2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.
In v12-v15 I mostly concentrated on this feature.
The last version looks good to me.
3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.
I think it's a good idea. memmove must be much faster than copying
items tuple by tuple.
I'll send next patch by the end of the week.
4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.We should try to be careful when using these terms, as there is a very
real danger of talking past each other.Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.
Now that extra FPW are proven to be a bug, I agree that giving up on
deduplication early is not necessary.
My previous considerations were based on the idea that deduplication
always adds considerable overhead,
which is not true after recent optimizations.
So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:+ /* + * Project size of new posting list that would result from merging + * current tup with pending posting list (could just be prev item + * that's "pending"). + * + * This accounting looks odd, but it's correct because ... + */ + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) + + (dedupState->ntuples + itup_ntuples + 1) * + sizeof(ItemPointerData));Why the "+1" here?
I'll look at it.
I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.
New version is attached.
It is v14 (with PageSetLSN fix) merged with v13.
I also fixed a bug in btree_xlog_dedup(), that was previously masked by FPW.
v15 passes make installcheck.
I haven't tested it with land test yet. Will do it later this week.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v15-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v15-0001-Add-deduplication-to-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..83519cb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
}
/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
* we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..4257406 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+ OffsetNumber in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset)
+{
+ int nipd;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nipd = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nipd);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original
+ * rightmost TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new
+ * item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(itup)) < 0);
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
+ IndexTuple original_itup = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1064,47 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ original_itup = CopyIndexTuple(itup);
+
+ nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ original_itup, nposting, in_posting_offset);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Handle a posting list split by performing an in-place update of
+ * the existing posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.in_posting_offset = in_posting_offset;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+ if (original_itup)
+ XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+ else
+ XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
+ if (original_itup)
+ pfree(original_itup);
+
}
/*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * nposting is a replacement posting for the posting list at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple original_newitem,
+ IndexTuple nposting, OffsetNumber in_posting_offset)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1407,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1415,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ /*
+ * If replacing posting item was put on the right page,
+ * we don't need to explicitly WAL log it because it's included
+ * with all the other items on the right page.
+ * Otherwise, save in_posting_offset and newitem to construct
+ * replacing tuple.
+ */
+ xlrec.in_posting_offset = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.in_posting_offset = in_posting_offset;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* is not stored if XLogInsert decides it needs a full-page image of
* the left page. We store the offset anyway, though, to support
* archive compression of these records.
+ *
+ * Also save newitem in case posting split was required
+ * to construct new posting.
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.in_posting_offset)
+ {
+ if (xlrec.in_posting_offset)
+ {
+ Assert(original_newitem != NULL);
+ Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+ XLogRegisterBufData(0, (char *) original_newitem,
+ MAXALIGN(IndexTupleSize(original_newitem)));
+ }
+ else
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2536,415 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
+ */
+}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /*
+ * Copy the original page's LSN into newpage, which will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ if (deduplicate &&
+ _bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+ {
+ int itup_ntuples;
+ Size projpostingsz;
+
+ /*
+ * Tuples are equal.
+ *
+ * If posting list does not exceed tuple size limit then append
+ * the tuple to the pending posting list. Otherwise, insert it on
+ * page and continue with this tuple as new pending posting list.
+ */
+ itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ /*
+ * Project size of new posting list that would result from merging
+ * current tup with pending posting list (could just be prev item
+ * that's "pending").
+ *
+ * This accounting looks odd, but it's correct because ...
+ */
+ projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+ (dedupState->ntuples + itup_ntuples + 1) *
+ sizeof(ItemPointerData));
+
+ if (projpostingsz <= dedupState->maxitemsize)
+ _bt_stash_item_tid(dedupState, itup, offnum);
+ else
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal, or we're done deduplicating items on this
+ * page.
+ *
+ * Insert pending posting list on page. This could just be a
+ * regular tuple.
+ */
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+ }
+
+ /*
+ * When we have deduplicated enough to avoid page split, don't bother
+ * deduplicating any more items.
+ *
+ * FIXME: If rewriting the page and doing the WAL logging were
+ * incremental, we could actually break out of the loop and save real
+ * work. As things stand this is a loss for performance, but it
+ * barely affects space utilization. (The number of blocks are the
+ * same as before, except for rounding effects. The minimum number of
+ * items on each page for each index "increases" when this is enabled,
+ * however.)
+ */
+ if (pagesaving >= newitemsz)
+ deduplicate = false;
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_insert(newpage, dedupState);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (dedupState->n_intervals == 0)
+ {
+ pfree(dedupState);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.n_intervals = dedupState->n_intervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /* only save non-empthy part of the array */
+ if (dedupState->n_intervals > 0)
+ XLogRegisterData((char *) dedupState->dedup_intervals,
+ dedupState->n_intervals * sizeof(dedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(dedupState);
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum)
+{
+ int nposting = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ dedupState->ipd = palloc0(dedupState->maxitemsize);
+ dedupState->alltupsize =
+ MAXALIGN(IndexTupleSize(dedupState->itupprev)) +
+ sizeof(ItemIdData);
+
+ /*
+ * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * itupprev would be merged with the next tuple, or stay as-is. By
+ * now caller compared it against itup and found that it was equal, so
+ * we can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->itupprev))
+ {
+ memcpy(dedupState->ipd, dedupState->itupprev,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+ memcpy(dedupState->ipd,
+ BTreeTupleGetPosting(dedupState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ /* Save info about deduplicated items for future xlog record */
+ dedupState->n_intervals++;
+ /* Save offnum of the first item belongin to the group */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
+ */
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+ }
+
+ /*
+ * Add current tup to ipd for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->ipd + dedupState->ntuples, itup,
+ sizeof(ItemPointerData));
+ dedupState->ntuples++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->ipd + dedupState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->ntuples += nposting;
+ }
+
+ dedupState->alltupsize +=
+ MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ /*
+ * Update the number of deduplicated items, belonging to this group.
+ * Count each item just once, no matter if it was posting tuple or not
*/
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+ /* TODO just a debug message. delete it in final version of the patch */
+ if (itup_offnum != InvalidOffsetNumber)
+ elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+ dedupState->n_intervals,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+ dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+Size
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+ IndexTuple itup;
+ Size spacesaving = 0;
+
+ if (dedupState->ntuples == 0)
+ {
+ /*
+ * Use original itupprev, which may or may not be a posting list
+ * already from some earlier dedup attempt
+ */
+ itup = dedupState->itupprev;
+ }
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+
+ spacesaving = dedupState->alltupsize -
+ (MAXALIGN(IndexTupleSize(postingtuple)) + sizeof(ItemIdData));
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ itup = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+ /* Add the new item into the page */
+ if (PageAddItem(page, (Item) itup, IndexTupleSize(itup),
+ OffsetNumberNext(PageGetMaxOffsetNumber(page)),
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (dedupState->ntuples > 0)
+ pfree(itup);
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+
+ return spacesaving;
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
}
/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
+/*
* Delete item(s) from a btree page during single-page cleanup.
*
* As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,79 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] =
+ BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1329,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1432,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..821e808 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
}
/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
+/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
* page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1611,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same truncated IndexTuple for
+ * every logical tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..b51365a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1128,6 +1139,40 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->ntuples == 0)
+ to_insert = dedupState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+ dedupState->ipd,
+ dedupState->ntuples);
+ to_insert = postingtuple;
+ pfree(dedupState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->ntuples > 0)
+ pfree(to_insert);
+ dedupState->ntuples = 0;
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1141,9 +1186,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1313,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->alltupsize = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+ else
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->itupprev)
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, dedupState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..f7575ed 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
+
+ if (BTreeTupleIsPosting(ituple))
+ {
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
{
- /* found the item */
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2694,74 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..5eace6e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->in_posting_offset != InvalidOffsetNumber)
+ {
+ /* oposting must be at offset before new item */
+ ItemId itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+ IndexTuple newitem = (IndexTuple) datapos;
+ IndexTuple nposting;
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+ Assert(isleaf);
+
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+
+ /* replace existing posting */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->in_posting_offset)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ /*
+ * Repeat logic implemented in _bt_insertonpg():
+ *
+ * If the new tuple is a duplicate with a heap TID that falls
+ * inside the range of an existing posting list tuple,
+ * generate new posting tuple to replace original one
+ * and update new tuple so that it's heap TID contains
+ * the rightmost heap TID of original posting tuple.
+ */
+ if (xlrec->in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+ IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+ nposting = _bt_form_newposting(newitem, oposting,
+ xlrec->in_posting_offset);
+
+ /* Alter new item offset, since effective new item changed */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+ leftoff, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -380,14 +440,147 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items
+ * to that in item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum, minoff, maxoff;
+ BTDedupState *dedupState = NULL;
+ char *data = ((char *) xlrec + SizeOfBtreeDedup);
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ int nth_interval = 0;
+ OffsetNumber n_dedup_tups = 0;
+
+ dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+ dedupState->ipd = NULL;
+ dedupState->ntuples = 0;
+ dedupState->itupprev = NULL;
+ dedupState->maxitemsize = BTMaxItemSize(page);
+ dedupState->maxpostingsize = 0;
+
+ memcpy(dedup_intervals, data,
+ xlrec->n_intervals*sizeof(dedupInterval));
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ if (dedupState->itupprev == NULL)
+ {
+ /* Just set up base/first item in first iteration */
+ Assert(offnum == minoff);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ continue;
+ }
+
+ /*
+ * Instead of comparing tuple's keys, which may be costly, use
+ * information from xlog record. If current tuple belongs to the
+ * group of deduplicated items, repeat logic of _bt_dedup_one_page
+ * and stash it to form a posting list afterwards.
+ */
+ if (nth_interval < xlrec->n_intervals &&
+ dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+ && n_dedup_tups < dedup_intervals[nth_interval].ntups)
+ {
+ _bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+ elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+ offnum,
+ nth_interval,
+ dedup_intervals[nth_interval].from,
+ dedup_intervals[nth_interval].ntups);
+
+ /* count first tuple in the group */
+ if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+ n_dedup_tups++;
+
+ /* count added tuple */
+ n_dedup_tups++;
+ }
+ else
+ {
+ _bt_dedup_insert(newpage, dedupState);
+
+ /* reset state */
+ if (n_dedup_tups > 0)
+ nth_interval++;
+ n_dedup_tups = 0;
+ }
+
+ pfree(dedupState->itupprev);
+ dedupState->itupprev = CopyIndexTuple(itup);
+ dedupState->itupprev_off = offnum;
+ }
+
+ /* Handle the last item */
+ _bt_dedup_insert(newpage, dedupState);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
+static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +671,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -838,6 +1051,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..7351cad 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; in_posting_offset %u",
+ xlrec->offnum, xlrec->in_posting_offset);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,29 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->in_posting_offset);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "items were deduplicated to %d items",
+ xlrec->n_intervals);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +145,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..adf52c9 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,146 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+ OffsetNumber from;
+ OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+
+ /*
+ * array with info about deduplicated items on the page.
+ *
+ * It contains one entry for each group of consecutive items that
+ * were deduplicated into a single posting tuple.
+ *
+ * This array is saved to xlog entry, which allows to replay
+ * deduplication faster without actually comparing tuple's keys.
+ */
+ dedupInterval dedup_intervals[MaxOffsetNumber];
+ /* current number of items in dedup_intervals array */
+ int n_intervals;
+ /* temp state variable to keep a 'possible' start of dedup interval */
+ OffsetNumber itupprev_off;
+
+ int ntuples;
+ Size alltupsize;
+ ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
-/* Get/set downlink block number */
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +480,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +687,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -534,7 +728,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +759,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +778,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,9 +930,13 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
-extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
-
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+ OffsetNumber in_posting_offset);
+extern Size _bt_dedup_insert(Page page, BTDedupState *dedupState);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+ OffsetNumber itup_offnum);
/*
* prototypes for functions in nbtsplitloc.c
*/
@@ -762,6 +966,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1018,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..7d41adc 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* compactify tuples on the page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if in_posting_offset is valid, this is an insertion
+ * into existing posting tuple at offnum.
+ * redo must repeat logic of bt_insertonpg().
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber in_posting_offset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,11 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -112,9 +123,26 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber in_posting_offset; /* offset inside posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+ int n_intervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -172,10 +200,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.I think it's a good idea. memmove must be much faster than copying
items tuple by tuple.
I'll send next patch by the end of the week.
I think that the biggest problem is that we copy all of the tuples,
including existing posting list tuples that can't be merged with
anything. Even if you assume that we'll never finish early (e.g. by
using logic like the "if (pagesaving >= newitemsz) deduplicate =
false;" thing), this can still unnecessarily slow down deduplication.
Very often, _bt_dedup_one_page() is called when 1/2 - 2/3 of the
space on the page is already used by a small number of very large
posting list tuples.
The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:
I'll look at it.
I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)
The general idea is that the loop in _bt_dedup_one_page() now
explicitly operates with a "base" tuple, rather than *always* saving
the prev tuple from the last loop iteration. We always have a "pending
posting list", which won't be written-out as a posting list if it
isn't possible to merge at least one existing page item. The "base"
tuple doesn't change. "pagesaving" space accounting works in a way
that doesn't care about whether or not the base tuple was already a
posting list -- it saves the size of the IndexTuple without any
existing posting list size, and calculates the contribution to the
total size of the new posting list separately (heap TIDs from the
original base tuple and subsequent tuples are counted together).
This has a number of advantages:
* The loop is a lot clearer now, and seems to have slightly better
space utilization because of improved accounting (with or without the
"if (pagesaving >= newitemsz) deduplicate = false;" thing).
* I think that we're going to need to be disciplined about which tuple
is the "base" tuple for correctness reasons -- we should always use
the leftmost existing tuple to form a new posting list tuple. I am
concerned about rare cases where we deduplicate tuples that are equal
according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless
have different sizes (and are not bitwise equal). There are rare cases
involving TOAST compression where that is just about possible (see the
temp comments I added to _bt_keep_natts_fast() in the patch).
* It's clearly faster, because there is far less palloc() overhead --
the "land" unlogged table test completes in about 95.5% of the time
taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate =
false;" for both versions here, to keep it simple and fair).
This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.
--
Peter Geoghegan
On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.
I think that I forgot to mention that I am concerned that the
kill_prior_tuple/LP_DEAD optimization could be applied less often
because _bt_dedup_one_page() operates too aggressively. That is a big
part of my general concern.
Maybe I'm wrong about this -- who knows? I definitely think that
LP_DEAD setting by _bt_check_unique() is generally a lot more
important than LP_DEAD setting by the kill_prior_tuple optimization,
and the patch won't affect unique indexes. Only very serious
benchmarking can give us a clear answer, though.
--
Peter Geoghegan
On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)
I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.
Hopefully this will be a bit easier to work with when you go to make
_bt_dedup_one_page() do raw PageIndexMultiDelete() + PageAddItem()
calls against the page contained in a buffer directly (rather than
using a temp version of the page in local memory in the style of
_bt_split()). I find the loop within _bt_dedup_one_page() much easier
to follow now.
While I'm looking forward to seeing the
PageIndexMultiDelete()/PageAddItem() approach that you come up with,
the basic design of _bt_dedup_one_page() seems to be in much better
shape today than it was a few weeks ago. I am going to spend the next
few days teaching _bt_dedup_one_page() about space utilization. I'll
probably make it respect a fillfactor-style target. I've noticed that
it is often too aggressive about filling a page, though less often it
actually shows the opposite problem: it fails to use more than about
2/3 of the page for the same value, again and again (must be something
to do with the exact width of the tuples). In general,
_bt_dedup_one_page() should know a few things about what nbtsplitloc.c
will do when the page is very likely to be split soon.
I'll also spend some more time working on the opclass infrastructure
that we need to disable deduplication with datatypes where it is
unsafe [1]/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan.
Other changes:
* qsort() is no longer used by BTreeFormPostingTuple() in v16 -- we
can easily sorting the array of heap TIDs the caller's responsibility.
Since the heap TID column is sorted in ascending order among
duplicates on a page, and since TIDs within individual posting lists
are also sorted in ascending order, there is no need to resort. I
added a new assertion to BTreeFormPostingTuple() that verifies that
its caller actually gets it right.
* The new nbtpage.c/VACUUM code has been tweaked to minimize the
changes required against master. Nothing significant, though.
It was easier to refactor the _bt_dedup_one_page() stuff by
temporarily making nbtsort.c not use it. I didn't want to delay
getting v16 to you, so I didn't take the time to fix-up nbtsort.c to
use the new stuff. It's actually using its own old copy of stuff that
it should get from nbtinsert.c in v16 -- it calls
_bt_dedup_item_tid_sort(), not the new _bt_dedup_save_htid() function.
I'll update it soon, though.
[1]: /messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v16-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v16-0001-Add-deduplication-to-nbtree.patchDownload
From 45931efca014c9550d06a208574d9e508c85800b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v16 1/2] Add deduplication to nbtree.
---
contrib/amcheck/verify_nbtree.c | 164 +++++-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtinsert.c | 741 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 148 ++++-
src/backend/access/nbtree/nbtree.c | 128 +++-
src/backend/access/nbtree/nbtsearch.c | 243 +++++++-
src/backend/access/nbtree/nbtsort.c | 231 +++++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 +-
src/backend/access/nbtree/nbtutils.c | 264 ++++++++-
src/backend/access/nbtree/nbtxlog.c | 249 +++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 26 +-
src/include/access/nbtree.h | 281 ++++++++-
src/include/access/nbtxlog.h | 55 +-
src/tools/valgrind.supp | 21 +
15 files changed, 2505 insertions(+), 171 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..83519cb7cf 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.in_posting_offset <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..710c8d5cd5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple original_newitem,
+ IndexTuple nposting, OffsetNumber in_posting_offset);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.in_posting_offset = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.in_posting_offset, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->in_posting_offset == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->in_posting_offset = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->in_posting_offset >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,74 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'itup' is a duplicate with a heap
+ * TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * It must also change newitem to have the heap TID of the rightmost TID in
+ * the original posting list.
+ *
+ * Note that the WAL-logging considerations for posting list splits are
+ * complicated by the need to WAL-log the original newitem passed here instead
+ * of the effective/final newitem actually inserted on the page. This routine
+ * is used during recovery to avoid naively WAL-logging posting list returned
+ * here, which is often much larger than the typical newitem.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber in_posting_offset)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(in_posting_offset < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'in_posting_offset' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1019,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1038,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int in_posting_offset,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple nposting = NULL;
+ IndexTuple oposting;
+ IndexTuple original_itup = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1060,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1072,46 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(in_posting_offset > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ original_itup = CopyIndexTuple(itup);
+ nposting = _bt_posting_split(itup, oposting, in_posting_offset);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1144,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ original_itup, nposting, in_posting_offset);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1224,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1277,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.in_posting_offset = in_posting_offset;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1314,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (!original_itup)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) original_itup,
+ IndexTupleSize(original_itup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1368,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (nposting)
+ pfree(nposting);
+ if (original_itup)
+ pfree(original_itup);
+
}
/*
@@ -1211,10 +1392,19 @@ _bt_insertonpg(Relation rel,
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
+ *
+ * original_newitem, nposting, and in_posting_offset are needed for
+ * posting list splits that happen to result in a page split.
+ * nposting is a replacement tuple for the posting list tuple at the
+ * offset immediately before the new item's offset. This is needed
+ * when caller performed "posting list split", and corresponds to the
+ * same step for retail insertions that don't split the page.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple original_newitem, IndexTuple nposting,
+ OffsetNumber in_posting_offset)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1426,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of posting list that will be updated in place
+ * as part of split that follows a posting list split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1471,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size
+ * and have the same key values, so this omission can't affect the split
+ * point chosen in practice.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1545,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1581,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1691,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1653,6 +1879,29 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ /*
+ * If the replacement posting list (and final newitem) go on the right
+ * page then we don't need to explicitly WAL log it for the same
+ * reason we don't log any kind of newitem when it goes on the right
+ * page: it's included with all the other items on the right page
+ * already.
+ *
+ * Otherwise, we set in_posting_offset in WAL record, and explicitly
+ * log the original newitem (not the effective newitem). This allows
+ * REDO to reconstruct nposting by following essentially the same
+ * procedure as our caller used.
+ *
+ * Note: It's possible that our split point makes the posting list
+ * lastleft, and the rewritten newitem firstright. That's okay, since
+ * we'll log the original newitem either way. (Only the _final_
+ * version of newitem is available to REDO as the first data item from
+ * left page in this case, so explicitly logging the original newitem
+ * only occurs when strictly necessary.)
+ */
+ xlrec.in_posting_offset = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.in_posting_offset = in_posting_offset;
+
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1673,8 +1922,29 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* the left page. We store the offset anyway, though, to support
* archive compression of these records.
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.in_posting_offset != InvalidOffsetNumber)
+ {
+ if (xlrec.in_posting_offset == InvalidOffsetNumber)
+ {
+ /* simple, common case -- must WAL-log ordinary newitem */
+ Assert(newitemonleft);
+ Assert(nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /*
+ * REDO must reconstruct effective/final new item from
+ * original newitem, while updating existing posting list
+ * tuple that was split in place. Log the original new item
+ * instead of the final new item.
+ */
+ Assert(ItemPointerCompare(&original_newitem->t_tid,
+ &newitem->t_tid) != 0);
+ XLogRegisterBufData(0, (char *) original_newitem,
+ MAXALIGN(IndexTupleSize(original_newitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2104,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2574,439 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate = false;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ /* Convenience variables concerning generic limits */
+ state->maxitemsize = BTMaxItemSize(page);
+ state->maxpostingsize = 0;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->base_off = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Finally, n_intervals should be initialized to zero */
+ state->n_intervals = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /*
+ * Copy the original page's LSN into newpage, which will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first data
+ * item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending posting
+ * list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * When we have deduplicated enough to avoid page split, don't
+ * bother merging together existing tuples to create new posting
+ * lists.
+ *
+ * Note: We deliberately add as many heap TIDs as possible to a
+ * pending posting list by performing this check at this point
+ * (just before a new pending posting lists is created). It would
+ * be possible to make the final new posting list for each
+ * successful page deduplication operation as small as possible
+ * while still avoiding a page split for caller. We don't want to
+ * repeatedly merge posting lists around the same range of heap
+ * TIDs, though.
+ *
+ * (Besides, the total number of new posting lists created is the
+ * cost that this check is supposed to minimize -- there is no
+ * great reason to be concerned about the absolute number of
+ * existing tuples that can be killed/replaced.)
+ */
+#if 0
+ /* Actually, don't do that */
+ /* TODO: Make a final decision on this */
+ if (pagesaving >= newitemsz)
+ deduplicate = false;
+#endif
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (state->n_intervals == 0)
+ {
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.n_intervals = state->n_intervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ /* only save non-empthy part of the array */
+ if (state->n_intervals > 0)
+ XLogRegisterData((char *) state->dedup_intervals,
+ state->n_intervals * sizeof(dedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple on the page either becomes the base tuple for a posting list or
+ * gets merged with pending posting list at least once. It remains to be seen
+ * whether or not it will actually be possible to merge together subsequent
+ * tuples on the page with this one, though.
+ *
+ * Exported for use by recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->base_off = base_off;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save base_off in pending state for interval */
+ state->dedup_intervals[state->n_intervals].from = state->base_off;
+}
+
+/*
+ * Add new posting tuple item to the page based on base and the saved list of
+ * heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Page page, BTDedupState *state)
+{
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber finaloff;
+ Size spacesaving;
+
+ Assert(state->nhtids > 0);
+ Assert(state->nitems >= 1);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->dedup_intervals[state->n_intervals].from == state->base_off);
+
+ if (state->nitems == 1)
+ {
+ /* Use original, unchanged base tuple */
+ final = state->base;
+ spacesaving = 0;
+ finalsz = IndexTupleSize(final);
+
+ /* Do not increment n_intervals -- skip WAL logging */
+ }
+ else
+ {
+ /* Form a tuple with a posting list */
+ final = BTreeFormPostingTuple(state->base, state->htids,
+ state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->dedup_intervals[state->n_intervals].nitems = state->nitems;
+
+ /* Advance to next candidate */
+ state->n_intervals++;
+ }
+
+
+ finaloff = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+ Assert(finalsz <= state->maxitemsize);
+ if (PageAddItem(page, (Item) final, finalsz, finaloff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (final != state->base)
+ pfree(final);
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * If it's not possible to merge itup with pending posting list, returns
+ * false; caller should finish the pending posting list, and start a new one
+ * with itup as its base tuple. Otherwise, saves itup's heap TID(s) to local
+ * state, guaranteeing that at least that many heap TIDs can be merged
+ * together later on, when the current pending posting list is finished.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..648825e895 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc0(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointerData *ttids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ ttids = (ItemPointerData *)
+ repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &ttids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+ pfree(ttids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..b03bf67c26 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreepostingremains(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,15 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+
+ /* Updatable item state (for posting lists_ */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1239,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1249,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1274,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining = 0;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples.
+ *
+ * Have to consider the need to VACUUM away the "logical"
+ * typles contained in posting list tuple
+ */
+ newhtids = btreepostingremains(vstate, itup, &nremaining);
+ if (nremaining == 0)
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted, we can delete whole physical tuple as
+ * if it wasn't a posting list tuple.
+ */
+ deletable[ndeletable++] = offnum;
+ Assert(newhtids == NULL);
+ }
+ else if (nremaining < BTreeTupleGetNPosting(itup))
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * A subset of the logical tuples/TIDs must remain.
+ * Perform an update (page delete + page add item) to
+ * delete some but not all logical tuples in the
+ * posting list.
+ *
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ *
+ * Note that the new tuple won't be a posting list
+ * tuple when only one remaining logical tuple isn't
+ * in the process of being killed.
+ */
+ updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple are
+ * remain, so no update or delete is required.
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ pfree(newhtids);
+ }
+ }
}
}
@@ -1274,7 +1346,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1362,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1375,6 +1448,43 @@ restart:
}
}
+/*
+ * btreepostingremains() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list. The array's size is returned by setting *nremaining.
+ *
+ * If all items are dead, returns NULL.
+ */
+static ItemPointer
+btreepostingremains(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..af5e136af7 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->in_posting_offset == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set in_posting_offset for caller. Caller must
+ * split the posting list when in_posting_offset is set. This should
+ * happen infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->in_posting_offset =
+ _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1701,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1744,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1758,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a truncated version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same truncated IndexTuple for every
+ * logical tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..480a7824d4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -285,6 +285,8 @@ static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level);
static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
+static void _bt_sortdedup(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
@@ -301,6 +303,7 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
BTShared *btshared, Sharedsort *sharedsort,
Sharedsort *sharedsort2, int sortmem,
bool progress);
+static void _bt_dedup_item_tid_sort(BTDedupState *dedupState, IndexTuple itup);
/*
@@ -798,6 +801,43 @@ _bt_sortaddtup(Page page,
elog(ERROR, "failed to add item to the index page");
}
+/*
+ * Add new tuple (posting or non-posting) to the page being built.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup(), but it avoids incremental
+ * space accounting, and adds a new tuple using nbtsort.c facilities.
+ */
+static void
+_bt_sortdedup(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dedupState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (dedupState->nhtids == 0)
+ to_insert = dedupState->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dedupState->base,
+ dedupState->htids,
+ dedupState->nhtids);
+ to_insert = postingtuple;
+ pfree(dedupState->htids);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (dedupState->nhtids > 0)
+ pfree(to_insert);
+ dedupState->nhtids = 0;
+}
+
/*----------
* Add an item to a disk page from the sort output.
*
@@ -830,6 +870,8 @@ _bt_sortaddtup(Page page,
* the high key is to be truncated, offset 1 is deleted, and we insert
* the truncated high key at offset 1.
*
+ * Note that itup may be a posting list tuple.
+ *
* 'last' pointer indicates the last offset added to the page.
*----------
*/
@@ -963,6 +1005,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
* space, so this should directly reuse the existing tuple space.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1049,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1091,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1141,9 +1190,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate = false;
+ BTDedupState *dedupState = NULL;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1257,19 +1317,99 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!deduplicate)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init deduplication state needed to build posting tuples */
+ dedupState = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ /* Convenience variables concerning generic limits */
+ dedupState->maxitemsize = 0;
+ dedupState->maxpostingsize = 0;
+ /* Metadata about current pending posting list */
+ dedupState->htids = NULL;
+ dedupState->nhtids = 0;
+ dedupState->nitems = 0;
+ dedupState->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ dedupState->base = NULL;
+ dedupState->base_off = InvalidOffsetNumber;
+ dedupState->basetupsize = 0;
+ /* Finally, n_intervals should be initialized to zero */
+ dedupState->n_intervals = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (dedupState->base != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ dedupState->base, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((dedupState->nhtids + 1) *
+ sizeof(ItemPointerData) <
+ dedupState->maxpostingsize)
+ _bt_dedup_item_tid_sort(dedupState, itup);
+ else
+ _bt_sortdedup(wstate, state, dedupState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert base into index. Save
+ * current tuple for the next iteration.
+ */
+ _bt_sortdedup(wstate, state, dedupState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (dedupState->base)
+ pfree(dedupState->base);
+ dedupState->base = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ dedupState->maxpostingsize = dedupState->maxitemsize -
+ IndexInfoFindDataOffset(dedupState->base->t_info) -
+ MAXALIGN(IndexTupleSize(dedupState->base));
+ }
+
+ /* Handle the last item */
+ _bt_sortdedup(wstate, state, dedupState);
}
}
@@ -1798,3 +1938,72 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
if (btspool2)
tuplesort_end(btspool2->sortstate);
}
+
+/*
+ * FIXME: Merge this with _bt_dedup_item_tid(), which still has global
+ * linkage.
+ */
+static void
+_bt_dedup_item_tid_sort(BTDedupState *dedupState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (dedupState->nhtids == 0)
+ {
+ dedupState->htids = palloc0(dedupState->maxitemsize);
+ dedupState->alltupsize =
+ MAXALIGN(IndexTupleSize(dedupState->base)) +
+ sizeof(ItemIdData);
+
+ /*
+ * base hasn't had its posting list TIDs copied into htids yet (must
+ * have been first on page and/or in new posting list?). Do so now.
+ *
+ * This is delayed because it wasn't initially clear whether or not
+ * base would be merged with the next tuple, or stay as-is. By now
+ * caller compared it against itup and found that it was equal, so we
+ * can go ahead and add its TIDs.
+ */
+ if (!BTreeTupleIsPosting(dedupState->base))
+ {
+ memcpy(dedupState->htids, dedupState->base,
+ sizeof(ItemPointerData));
+ dedupState->nhtids++;
+ }
+ else
+ {
+ /* if base is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(dedupState->base);
+ memcpy(dedupState->htids,
+ BTreeTupleGetPosting(dedupState->base),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->nhtids += nposting;
+ }
+ }
+
+ /*
+ * Add current tup to htids for pending posting list for new version of
+ * page.
+ */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ memcpy(dedupState->htids + dedupState->nhtids, itup,
+ sizeof(ItemPointerData));
+ dedupState->nhtids++;
+ }
+ else
+ {
+ /*
+ * if tuple is posting, add all its TIDs to the pending list that will
+ * become new posting list later on
+ */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(dedupState->htids + dedupState->nhtids,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ dedupState->nhtids += nposting;
+ }
+
+ dedupState->alltupsize +=
+ MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..7460bf264d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple. Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple. It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..ae786404ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,39 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->in_posting_offset == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ /* posting list split (of posting list just before new item) */
+ ItemId itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+ IndexTuple newitem = (IndexTuple) datapos;
+ IndexTuple nposting;
+
+ /*
+ * Reconstruct nposting from original newitem, and make original
+ * newitem into final newitem
+ */
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->in_posting_offset);
+ Assert(isleaf);
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+
+ /* Replace existing/original posting list */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+ xlrec->offnum, false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +295,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->in_posting_offset)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ /*
+ * Repeat logic implemented in _bt_insertonpg():
+ *
+ * If the new tuple is a duplicate with a heap TID that falls
+ * inside the range of an existing posting list tuple, generate
+ * new posting tuple to replace original, and update new tuple
+ * from WAL record so that it becomes the "final" newitem inserted
+ * originally.
+ */
+ if (xlrec->in_posting_offset != 0)
+ {
+ ItemId itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+ IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->in_posting_offset);
+
+ /* Split posting list must be at offset before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +359,16 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ if (off == replacepostingoff)
+ {
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
if (onleft && off == xlrec->newitemoff)
{
@@ -379,6 +444,141 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ BTDedupState *state = NULL;
+ char *data = ((char *) xlrec + SizeOfBtreeDedup);
+ dedupInterval dedup_intervals[MaxIndexTuplesPerPage];
+ int nth_interval = 0;
+ OffsetNumber interval_nitems = 0;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ /* Convenience variables concerning generic limits */
+ state->maxitemsize = BTMaxItemSize(page);
+ state->maxpostingsize = 0;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->base_off = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ state->n_intervals = 0;
+
+ memcpy(dedup_intervals, data,
+ xlrec->n_intervals * sizeof(dedupInterval));
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ interval_nitems++;
+ }
+ else if (nth_interval < xlrec->n_intervals &&
+ state->base_off >= dedup_intervals[nth_interval].from &&
+ interval_nitems < dedup_intervals[nth_interval].nitems)
+ {
+ /*
+ * Item is a part of pending posting list that will be formed
+ * using base tuple
+ */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+
+ interval_nitems++;
+ }
+ else
+ {
+ /*
+ * Tuple was not equal to pending posting list tuple on
+ * primary, or BTMaxItemSize() limit was reached on primary
+ */
+ _bt_dedup_finish_pending(newpage, state);
+
+ /* reset state */
+ if (interval_nitems > 1)
+ nth_interval++;
+ interval_nitems = 0;
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ interval_nitems++;
+ }
+ }
+
+ /* Handle the last item */
+ _bt_dedup_finish_pending(newpage, state);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +586,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +678,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -838,6 +1058,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..022cf091b1 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; in_posting_offset %u",
+ xlrec->offnum, xlrec->in_posting_offset);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,28 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, in_posting_offset %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->in_posting_offset);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "n_intervals %d", xlrec->n_intervals);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +144,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d0346c06c8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,153 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'from', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ * (Note: nitems means the number of line pointer items -- the tuples in
+ * question may already be posting list tuples or regular tuples.)
+ */
+typedef struct dedupInterval
+{
+ OffsetNumber from;
+ OffsetNumber nitems;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ /* Convenience variables concerning generic limits */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+ Size maxpostingsize;
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # valid heap TIDs in nhtids array */
+ int nitems; /* See dedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+
+ /* Metadata about based tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber base_off; /* original page offset of base */
+ Size basetupsize; /* Excludes line pointer overhead */
+
+
+ /*
+ * array with info about deduplicated items on the page. Current array
+ * size is n_intervals.
+ *
+ * It contains one entry for each group of consecutive items that were
+ * deduplicated into a single posting tuple.
+ *
+ * This array is saved to xlog entry, which allows to replay deduplication
+ * faster without actually comparing tuple's keys.
+ */
+ int n_intervals;
+ dedupInterval dedup_intervals[MaxIndexTuplesPerPage];
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +487,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +693,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int in_posting_offset;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +735,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +766,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +785,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +937,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber in_posting_offset);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern Size _bt_dedup_finish_pending(Page page, BTDedupState *state);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +975,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1027,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..761073ada5 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if in_posting_offset is set, this started out as an
+ * insertion into an existing posting tuple at the
+ * offset before offnum (i.e. it's a posting list split).
+ * (REDO will have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber in_posting_offset;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -95,6 +101,13 @@ typedef struct xl_btree_insert
* An IndexTuple representing the high key of the left page must follow with
* either variant.
*
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that happens to result in a page split. REDO recognizes this case
+ * when in_posting_offset is set, and must use the posting offset to do an
+ * in-place update of the existing posting list that was actually split, and
+ * change the newitem to the "final" newitem. This corresponds to the
+ * xl_btree_insert in_posting_offset-set case.
+ *
* Backup Blk 1: new right page
*
* The right page's data portion contains the right page's tuples in the form
@@ -112,9 +125,26 @@ typedef struct xl_btree_split
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber in_posting_offset; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the number of posting tuples that should be added
+ * to the page using n_intervals. An array of dedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+ int n_intervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, n_intervals) + sizeof(int))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +196,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.
I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)
Changes in v17:
* nbtsort.c now has a loop structure that closely matches
_bt_dedup_one_page() (I put this off in v16).
We now reuse most of the nbtinsert.c deduplication routines.
* Further simplification of btree_xlog_dedup() loop.
Recovery no longer relies on local variables to track the progress of
deduplication -- it uses dedup state (the state managed by
nbtinsert.c's dedup routines) instead. This is easier to follow.
* Reworked _bt_split() comments on posting list splits that coincide
with page splits.
* Fixed memory leaks in recovery code by creating a dedicated memory
context that gets reset regularly. The context is create in a new rmgr
"startup" callback I created for the B-Tree rmgr. We already do this
for both GIN and GiST.
More specifically, the REDO code calls MemoryContextReset() against
its dedicated memory context after every record is processed by REDO,
no matter what. The MemoryContextReset() call usually won't have to
actually free anything, but that's okay because the no-free case does
almost no work. I think that it makes sense to keep things as simple
as possible for memory management during recovery -- it's too easy for
a new memory leak to get introduced when a small change is made to the
nbtinsert.c routines later on.
* Optimize VACUUMing of posting lists: we now only allocate memory for
an array of still-live posting list items when the array will actually
be needed. It is only needed when there are tuples to remove from the
posting list, because only then do we need to create a replacement
posting list that lacks the heap TIDs that VACUUM needs to delete.
It seemed like a really good idea to not allocate any memory in the
common case where VACUUM doesn't need to change a posting list tuple
at all. ginVacuumItemPointers() has exactly the same optimization.
* Fixed an accounting bug in the output of VACCUM VERBOSE by changing
some code in nbtree.c.
The tuples_removed and num_index_tuples fields in
IndexBulkDeleteResult are reported as "index row versions" by VACUUM
VERBOSE. Everything but the index pages stat works at the level of
"index row versions", which should not be affected by the
deduplication patch. Of course, deduplication only changes the
physical representation of items in the index -- never the logical
contents of the index. This is what GIN does already.
Another infrastructure thing that the patch needs to handle to be committable:
We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?
I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.
--
Peter Geoghegan
Attachments:
v17-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v17-0001-Add-deduplication-to-nbtree.patchDownload
From 2e0ae900205fa421efabf2854d27e0810c3adf61 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 29 Aug 2019 14:35:35 -0700
Subject: [PATCH v17 1/4] Add deduplication to nbtree.
---
contrib/amcheck/verify_nbtree.c | 164 +++++-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtinsert.c | 751 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 148 ++++-
src/backend/access/nbtree/nbtree.c | 168 +++++-
src/backend/access/nbtree/nbtsearch.c | 242 +++++++-
src/backend/access/nbtree/nbtsort.c | 138 ++++-
src/backend/access/nbtree/nbtsplitloc.c | 47 +-
src/backend/access/nbtree/nbtutils.c | 264 ++++++++-
src/backend/access/nbtree/nbtxlog.c | 268 ++++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 26 +-
src/include/access/nbtree.h | 278 ++++++++-
src/include/access/nbtxlog.h | 68 ++-
src/include/access/rmgrlist.h | 2 +-
src/tools/valgrind.supp | 21 +
16 files changed, 2489 insertions(+), 174 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..d65e2a76eb 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..eb9655bb78 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list. Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context. Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery. Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page. This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1026,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1045,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1067,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1079,46 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_posting_split(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1151,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1231,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1284,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1321,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1375,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1397,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1437,20 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1482,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1556,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1592,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1702,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1887,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1911,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in record, though.
+ *
+ * The details are often slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery can't even tell the difference). Otherwise, we set
+ * postingoff and log orignewitem instead of newitem, despite having
+ * actually inserted newitem. Recovery must reconstruct nposting and
+ * newitem by repeating the actions of our caller (i.e. by passing
+ * original posting list and orignewitem to _bt_posting_split()).
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2580,439 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque oopaque,
+ nopaque;
+ bool deduplicate;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->deduplicate = true;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ /* Finally, nintervals should be initialized to zero */
+ state->nintervals = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /*
+ * Scan over all items to see which ones can be deduplicated
+ */
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /*
+ * Copy the original page's LSN into newpage, which will become the
+ * updated version of the page. We need this because XLogInsert will
+ * examine the LSN and possibly dump it in a page image.
+ */
+ PageSetLSN(newpage, PageGetLSN(page));
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(oopaque))
+ {
+ ItemId hitemid = PageGetItemId(page, P_HIKEY);
+ Size hitemsz = ItemIdGetLength(hitemid);
+ IndexTuple hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+ if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first data
+ * item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending posting
+ * list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * When we have deduplicated enough to avoid page split, don't
+ * bother merging together existing tuples to create new posting
+ * lists.
+ *
+ * Note: We deliberately add as many heap TIDs as possible to a
+ * pending posting list by performing this check at this point
+ * (just before a new pending posting lists is created). It would
+ * be possible to make the final new posting list for each
+ * successful page deduplication operation as small as possible
+ * while still avoiding a page split for caller. We don't want to
+ * repeatedly merge posting lists around the same range of heap
+ * TIDs, though.
+ *
+ * (Besides, the total number of new posting lists created is the
+ * cost that this check is supposed to minimize -- there is no
+ * great reason to be concerned about the absolute number of
+ * existing tuples that can be killed/replaced.)
+ */
+#if 0
+ /* Actually, don't do that */
+ /* TODO: Make a final decision on this */
+ if (pagesaving >= newitemsz)
+ state->deduplicate = false;
+#endif
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ pagesaving += _bt_dedup_finish_pending(newpage, state);
+
+ /*
+ * If no items suitable for deduplication were found, newpage must be
+ * exactly the same as the original page, so just return from function.
+ */
+ if (state->nintervals == 0)
+ {
+ pfree(newpage);
+ pfree(state->htids);
+ pfree(state);
+ return;
+ }
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.nintervals = state->nintervals;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ Assert(state->nintervals > 0);
+ XLogRegisterData((char *) state->intervals,
+ state->nintervals * sizeof(BTDedupInterval));
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->intervals[state->nintervals].baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Page page, BTDedupState *state)
+{
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber finaloff;
+ Size spacesaving;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+ if (state->nitems == 1)
+ {
+ /* Use original, unchanged base tuple */
+ final = state->base;
+ spacesaving = 0;
+ finalsz = IndexTupleSize(final);
+
+ /* Do not increment nintervals -- skip WAL logging/replay */
+ }
+ else
+ {
+ /* Form a tuple with a posting list */
+ final = BTreeFormPostingTuple(state->base, state->htids,
+ state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->intervals[state->nintervals].nitems = state->nitems;
+
+ /* Advance to next candidate */
+ state->nintervals++;
+ }
+
+ finaloff = OffsetNumberNext(PageGetMaxOffsetNumber(page));
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+ if (PageAddItem(page, (Item) final, finalsz, finaloff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ if (final != state->base)
+ pfree(final);
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ecf75ef2c0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1100,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..baea34ea74 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9022ee68ea 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1771,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a base version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..c51cbfb0ba 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+
+ Assert(dstate->nitems > 0);
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ }
+
+ _bt_buildadd(wstate, state, final);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1144,6 +1186,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ /* Don't use deduplication for INCLUDE indexes or unique indexes */
+ deduplicate = (keysz == IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1152,6 +1199,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* btspool and btspool2.
*/
+ Assert(!deduplicate);
/* the preparation of merge */
itup = tuplesort_getindextuple(btspool->sortstate, true);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1303,95 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->deduplicate = true; /* unused */
+ dstate->maxitemsize = 0; /* set later */
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+ dstate->nintervals = 0; /* unused */
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+ /* Conservatively size array */
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been
+ * saved in state. The next iteration will also end up here
+ * if it's possible to merge the next tuple into the same
+ * pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..54cecc85c5 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..7460bf264d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple. Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple. It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..365f0b4c79 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -181,9 +184,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_posting_split() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ /*
+ * Use _bt_posting_split() to repeat posting list split steps
+ * from primary
+ */
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +366,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +453,130 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ Page newpage;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ BTPageOpaque nopaque;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ BTDedupState *state;
+ BTDedupInterval *intervals;
+
+ /* Get 'nintervals'-sized array of intervals to process */
+ intervals = (BTDedupInterval *) ((char *) xlrec + SizeOfBtreeDedup);
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->deduplicate = true; /* unused */
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ state->nintervals = 0;
+
+ /* Scan over all items to see which ones can be deduplicated */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+ /* Make sure that new page won't have garbage flag set */
+ nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during deduplication");
+ }
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page to deduplicate them into posting
+ * lists and insert into new page
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == minoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->nintervals < xlrec->nintervals &&
+ state->baseoff == intervals[state->nintervals].baseoff &&
+ state->nitems < intervals[state->nintervals].nitems)
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ else
+ {
+ /*
+ * Tuple was not equal to pending posting list tuple on
+ * primary, or BTMaxItemSize() limit was reached on primary
+ */
+ _bt_dedup_finish_pending(newpage, state);
+
+ /* itup starts new pending posting list */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ }
+
+ /* Handle the last item */
+ _bt_dedup_finish_pending(newpage, state);
+
+ /* Assert that final working state matches WAL record state */
+ Assert(state->nintervals == xlrec->nintervals);
+ Assert(memcmp(state->intervals, intervals,
+ state->nintervals * sizeof(BTDedupInterval)) == 0);
+
+ PageRestoreTempPage(newpage, page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +584,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +676,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +1038,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1058,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1086,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..177875224a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,28 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "nintervals %d", xlrec->nintervals);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +144,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..da3c8f76a3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,150 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ /* Deduplication status info for entire page/operation */
+ bool deduplicate; /* Still deduplicating page? */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # valid heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+
+ /* Metadata about based tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* original page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Array of pending posting lists. Contains one entry for each group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ int nintervals; /* current size of intervals array */
+ BTDedupInterval intervals[MaxIndexTuplesPerPage];
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +484,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -499,6 +690,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +732,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +763,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +782,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +934,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Page page, BTDedupState *state);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +972,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1024,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..affdd910ec 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +97,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +127,27 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the number of posting tuples that should be added
+ * to the page using nintervals. An array of dedupInterval structs follows.
+ */
+typedef struct xl_btree_dedup
+{
+ int nintervals;
+
+ /* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nintervals) + sizeof(int))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
I attach version 17.
I attach a patch that applies on top of v17. It adds support for
deduplication within unique indexes. Actually, this is a snippet of
code that appeared in my prototype from August 5 (we need very little
extra code for this now). Unique index support kind of looked like a
bad idea at the time, but things have changed a lot since then.
I benchmarked this overnight using a custom pgbench-based test that
used a Zipfian distribution, with a single-row SELECT and an UPDATE of
pgbench_accounts per xact. I used regular/logged tables this time
around, since WAL-logging is now fairly efficient. I added a separate
low cardinality index on pgbench_accounts(abalance). A low cardinality
index is the most interesting case for this patch, obviously, but it
also serves to prevent all HOT updates, increasing bloat for both
indexes. We want a realistic case that creates a lot of index bloat.
This wasn't a rigorous enough benchmark to present here in full, but
the results were very encouraging. With reasonable client counts for
the underlying hardware, we seem to have a small increase in TPS, and
a small decrease in latency. There is a regression with 128 clients,
when contention is ridiculously high (this is my home server, which
only has 4 cores). More importantly:
* The low cardinality index is almost 3x smaller with the patch -- no
surprises there.
* The read latency is where latency goes up, if it goes up at all.
Whatever it is that might increase latency, it doesn't look like it's
deduplication itself. Yeah, deduplication is expensive, but it's not
nearly as expensive as a page split. (I'm talking about the immediate
cost, not the bigger picture, though the bigger picture matters even
more.)
* The growth in primary key size over time is the thing I find really
interesting. The patch seems to really control the number of pages
splits over many hours with lots of non-HOT updates. I think that a
timeline of days or weeks could be really interesting.
I am now about 75% convinced that adding deduplication to unique
indexes is a good idea, at least as an option that is disabled by
default. We're already doing well here, even though there has been no
work on optimizing deduplication in unique indexes. Further
optimizations seem quite possible, though. I'm mostly thinking about
optimizing non-HOT updates by teaching nbtree some basic things about
versioning with unique indexes.
For example, we could remember "recently dead" duplicates of the value
we are about to insert (as part of an UPDATE statement) from within
_bt_check_unique(). Then, when it looks like a page split may be
necessary, we can hint to _bt_dedup_one_page(): "please just
deduplicate the group of duplicates starting from this offset, which
are duplicates of the this new item I am inserting -- do not create a
posting list that I will have to split, though". We already cache the
binary search bounds established within _bt_check_unique() in
insertstate, so perhaps we could reuse that to make this work. The
goal here is that the the old/recently dead versions end up together
in their own posting list (or maybe two posting lists), whereas our
new/most current tuple is on its own. There is a very good chance that
our transaction will commit, leaving somebody else to set the LP_DEAD
bit on the posting list that contains those old versions. In short,
we'd be making deduplication and opportunistic garbage collection
cooperate closely.
This can work both ways -- maybe we should also teach
_bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is,
if we add the mechanism I just described in the last paragraph, maybe
_bt_dedup_one_page() marks the posting list that is likely to get its
LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here,
LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit
on this posting list tuple very soon". That way, if nobody actually
does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If
it goes to the heap and finds the latest version, and that that
version is visible to any possible MVCC snapshot, that means that it's
safe to kill all the other versions, even without the LP_DEAD bit set
-- this is a unique index. So, it often gets to kill lots of extra
garbage that it wouldn't get to kill, preventing page splits. The cost
is pretty low: the risk that the single heap page check will be a
wasted effort. (Of course, we still have to visit the heap pages of
things that we go on to kill, to get the XIDs to generate recovery
conflicts -- the important point is that we only need to visit one
heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do
all this -- cases that don't benefit at all also don't pay very much.)
I don't think that we need to do either of these two other things to
justify committing the patch with unique index support. But, teaching
nbtree a little bit about versioning like this could work rather well
in practice, without it really mattering that it will get the wrong
idea at times (e.g. when transactions abort a lot). This all seems
promising as a family of techniques for unique indexes. It's worth
doing extra work if it might delay a page split, since delaying can
actually fully prevent page splits that are mostly caused by non-HOT
updates. Most primary key indexes are serial PKs, or some kind of
counter. Postgres should mostly do page splits for these kinds of
primary keys indexes in the places that make sense based on the
dataset, and not because of "write amplification".
--
Peter Geoghegan
Attachments:
v17-0005-Reintroduce-unique-index-support.patchapplication/octet-stream; name=v17-0005-Reintroduce-unique-index-support.patchDownload
From 4884272644e6772a3f5ae9d87fae2236e5ac1f01 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 23 Sep 2019 20:28:20 -0700
Subject: [PATCH v17 5/5] Reintroduce unique index support
---
src/backend/access/nbtree/nbtinsert.c | 70 +++++++++++++++++++++++----
1 file changed, 60 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index eb9655bb78..1912fe9ee4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -434,15 +434,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!ItemIdIsDead(curitemid))
{
ItemPointerData htid;
+ bool posting;
bool all_dead;
+ bool posting_all_dead;
+ int npost;
+
if (_bt_compare(rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- Assert(!BTreeTupleIsPosting(curitup));
- htid = curitup->t_tid;
+
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ posting = false;
+ posting_all_dead = true;
+ }
+ else
+ {
+ posting = true;
+ /* Initial assumption */
+ posting_all_dead = true;
+ }
+
+ npost = 0;
+doposttup:
+ if (posting)
+ htid = *BTreeTupleGetPostingN(curitup, npost);
+
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -453,6 +474,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
found = true;
+ posting_all_dead = false;
+ if (posting)
+ goto nextpost;
}
/*
@@ -518,8 +542,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -577,7 +600,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && !posting)
{
/*
* The conflicting tuple (or whole HOT chain) is dead to
@@ -596,6 +619,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+ else if (posting)
+ {
+nextpost:
+ if (!all_dead)
+ posting_all_dead = false;
+
+ /* Iterate over single posting list tuple */
+ npost++;
+ if (npost < BTreeTupleGetNPosting(curitup))
+ goto doposttup;
+
+ /*
+ * Mark posting tuple dead if all hot chains whose root is
+ * contained in posting tuple have tuples that are all
+ * dead
+ */
+ if (posting_all_dead)
+ {
+ ItemIdMarkDead(curitemid);
+ opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+ if (nbuf != InvalidBuffer)
+ MarkBufferDirtyHint(nbuf, true);
+ else
+ MarkBufferDirtyHint(insertstate->buf, true);
+ }
+
+ /* Move on to next index tuple */
+ }
}
}
@@ -770,7 +822,7 @@ _bt_findinsertloc(Relation rel,
insertstate->bounds_valid = false;
}
- if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
_bt_dedup_one_page(rel, insertstate->buf, heapRel,
insertstate->itemsz);
@@ -2615,12 +2667,10 @@ _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
Size pagesaving = 0;
/*
- * Don't use deduplication for indexes with INCLUDEd columns and unique
- * indexes
+ * Don't use deduplication for indexes with INCLUDEd columns
*/
deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
- IndexRelationGetNumberOfAttributes(rel) &&
- !rel->rd_index->indisunique);
+ IndexRelationGetNumberOfAttributes(rel));
if (!deduplicate)
return;
--
2.17.1
24.09.2019 3:13, Peter Geoghegan wrote:
On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)
Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.
Another infrastructure thing that the patch needs to handle to be committable:
We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?
I'll send another version with dedup option soon.
I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.
Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v18-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v18-0001-Add-deduplication-to-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..d65e2a7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
}
/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
* we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..6e1dc59 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..c81f545 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * isn't a unique index, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +942,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list. Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context. Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery. Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page. This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1026,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1045,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1067,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1080,46 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_posting_split(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1151,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1231,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1284,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1321,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1375,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1397,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1437,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1445,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1273,6 +1482,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1556,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1592,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1702,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1887,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1911,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in record, though.
+ *
+ * The details are often slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery can't even tell the difference). Otherwise, we set
+ * postingoff and log orignewitem instead of newitem, despite having
+ * actually inserted newitem. Recovery must reconstruct nposting and
+ * newitem by repeating the actions of our caller (i.e. by passing
+ * original posting list and orignewitem to _bt_posting_split()).
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2304,6 +2580,405 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ bool deduplicate;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ /*
+ * Don't use deduplication for indexes with INCLUDEd columns and unique
+ * indexes
+ */
+ deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!deduplicate)
+ return;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->deduplicate = true;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ * NOTE It's essential to calculate max offset on each iteration,
+ * since it could have changed if several items were replaced with a
+ * single posting tuple.
+ */
+ offnum = minoff;
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data
+ * item as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending posting
+ * list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached.
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page,
+ * otherwise, just reset the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+ /*
+ * When we have deduplicated enough to avoid page split, don't
+ * bother merging together existing tuples to create new posting
+ * lists.
+ *
+ * Note: We deliberately add as many heap TIDs as possible to a
+ * pending posting list by performing this check at this point
+ * (just before a new pending posting lists is created). It would
+ * be possible to make the final new posting list for each
+ * successful page deduplication operation as small as possible
+ * while still avoiding a page split for caller. We don't want to
+ * repeatedly merge posting lists around the same range of heap
+ * TIDs, though.
+ *
+ * (Besides, the total number of new posting lists created is the
+ * cost that this check is supposed to minimize -- there is no
+ * great reason to be concerned about the absolute number of
+ * existing tuples that can be killed/replaced.)
+ */
+#if 0
+ /* Actually, don't do that */
+ /* TODO: Make a final decision on this */
+ if (pagesaving >= newitemsz)
+ state->deduplicate = false;
+#endif
+
+ /* Continue iteration from base tuple's offnum */
+ offnum = state->baseoff;
+
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * Handle the last item, if pending posting list is not empty.
+ */
+ if (state->nitems != 0)
+ pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ if (state->nitems > 1)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = BTreeFormPostingTuple(state->base, state->htids,
+ state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..ecf75ef 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
}
/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
+/*
* Delete item(s) from a btree page during single-page cleanup.
*
* As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..baea34e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1376,6 +1456,68 @@ restart:
}
/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..9022ee6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -529,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
}
/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
+/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
* page/offnum: location of btree item to be compared to.
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1611,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a base version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..f6ca690 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1058,6 +1064,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
}
/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+
+ Assert(dstate->nitems > 0);
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ }
+
+ _bt_buildadd(wstate, state, final);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
+/*
* Finish writing out the completed btree.
*/
static void
@@ -1144,6 +1186,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ /* Don't use deduplication for INCLUDE indexes or unique indexes */
+ deduplicate = (keysz == IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1152,6 +1199,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* btspool and btspool2.
*/
+ Assert(!deduplicate);
/* the preparation of merge */
itup = tuplesort_getindextuple(btspool->sortstate, true);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1303,94 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->deduplicate = true; /* unused */
+ dstate->maxitemsize = 0; /* set later */
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+ /* Conservatively size array */
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been
+ * saved in state. The next iteration will also end up here
+ * if it's possible to merge the next tuple into the same
+ * pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..7460bf2 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1395,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1557,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1797,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2140,6 +2176,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2210,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2219,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2244,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2263,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2287,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2305,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2314,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2396,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2439,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2522,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2589,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2620,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2694,85 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple. Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple. It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..2f741e1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -181,9 +184,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_posting_split() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ /*
+ * Use _bt_posting_split() to repeat posting list split steps
+ * from primary
+ */
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +366,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -380,14 +454,89 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->deduplicate = true; /* unused */
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval
+ * to deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
+static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +627,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
+
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +989,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1009,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1037,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..1dde2da 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..22b2e93 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,149 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ /* Deduplication status info for entire page/operation */
+ bool deduplicate; /* Still deduplicating page? */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # valid heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+
+ /* Metadata about based tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* original page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
-/* Get/set downlink block number */
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +483,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -500,6 +690,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -534,7 +731,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +762,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +781,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +933,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState* state, bool need_wal);
/*
* prototypes for functions in nbtsplitloc.c
@@ -762,6 +971,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1023,8 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..ebb39de 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +97,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +127,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +198,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +299,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2c..2b8c6c7 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.
This looks great! Even if it isn't faster than using a temp page
buffer, the flexibility seems like an important advantage. We can do
things like have the _bt_dedup_one_page() caller hint that
deduplication should start at a particular offset number. If that
doesn't work out by the time the end of the page is reached (whatever
"works out" may mean), then we can just start at the beginning of the
page, and work through the items we skipped over initially.
We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter.
Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().
Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.
Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.
I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.
--
Peter Geoghegan
25.09.2019 22:14, Peter Geoghegan wrote:
We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter.Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and
_bt_load().Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.
Attached is v19.
* It adds new btree reloption "deduplication".
I decided to refactor the code and move BtreeOptions into a separate
structure,
rather than adding new btree specific value to StdRelOptions.
Now it can be set even for indexes that do not support deduplication.
In that case it will be ignored. Should we add this check to option
validation?
* By default deduplication is on for non-unique indexes and off for
unique ones.
* New function _bt_dedup_is_possible() is intended to be a single place
to perform all the checks. Now it's just a stub to ensure that it works.
Is there a way to extract this from existing opclass information,
or we need to add new opclass field? Have you already started this work?
I recall there was another thread, but didn't manage to find it.
* I also integrated into this version your latest patch that enables
deduplication on unique indexes,
since now it can be easily switched on/off.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
v19-0001-Add-deduplication-to-nbtree.patchtext/x-patch; name=v19-0001-Add-deduplication-to-nbtree.patchDownload
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..d65e2a7 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2032,6 +2111,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
}
/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
+/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
* we rely on having fully unique keys to find a match with only a single
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 20f4ed3..3fdf3a5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1407,8 +1416,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d..6e1dc59 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..54cb9db 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..3ef44cd 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -428,14 +434,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!ItemIdIsDead(curitemid))
{
ItemPointerData htid;
+ bool posting;
bool all_dead;
+ bool posting_all_dead;
+ int npost;
+
if (_bt_compare(rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ posting = false;
+ posting_all_dead = true;
+ }
+ else
+ {
+ posting = true;
+ /* Initial assumption */
+ posting_all_dead = true;
+ }
+
+ npost = 0;
+doposttup:
+ if (posting)
+ htid = *BTreeTupleGetPostingN(curitup, npost);
+
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -446,6 +474,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
found = true;
+ posting_all_dead = false;
+ if (posting)
+ goto nextpost;
}
/*
@@ -511,8 +542,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,7 +600,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && !posting)
{
/*
* The conflicting tuple (or whole HOT chain) is dead to
@@ -589,6 +619,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+ else if (posting)
+ {
+nextpost:
+ if (!all_dead)
+ posting_all_dead = false;
+
+ /* Iterate over single posting list tuple */
+ npost++;
+ if (npost < BTreeTupleGetNPosting(curitup))
+ goto doposttup;
+
+ /*
+ * Mark posting tuple dead if all hot chains whose root is
+ * contained in posting tuple have tuples that are all
+ * dead
+ */
+ if (posting_all_dead)
+ {
+ ItemIdMarkDead(curitemid);
+ opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+ if (nbuf != InvalidBuffer)
+ MarkBufferDirtyHint(nbuf, true);
+ else
+ MarkBufferDirtyHint(insertstate->buf, true);
+ }
+
+ /* Move on to next index tuple */
+ }
}
}
@@ -689,6 +748,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +811,25 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->dedup_is_possible &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itemsz);
+ insertstate->bounds_valid = false; /* paranoia */
+ }
}
}
else
@@ -839,7 +911,37 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->dedup_is_possible);
+ /*
+ * Don't check if the option is enabled,
+ * since no actual deduplication will be done, just cleanup.
+ * TODO Shouldn't we use _bt_vacuum_one_page() instead?
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +1002,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list. Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context. Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery. Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page. This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1086,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1105,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1127,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1140,46 @@ _bt_insertonpg(Relation rel,
* need to be consistent */
/*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_posting_split(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
+ /*
* Do we need to split the page to fit the item on it?
*
* Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1211,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1291,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1344,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1373,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1382,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1436,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1458,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,6 +1498,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
@@ -1243,6 +1506,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
/*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+
+ /*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
* into origpage on success. rightpage is the new page that will receive
@@ -1273,6 +1543,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1617,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1653,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1763,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1948,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1972,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in record, though.
+ *
+ * The details are often slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery can't even tell the difference). Otherwise, we set
+ * postingoff and log orignewitem instead of newitem, despite having
+ * actually inserted newitem. Recovery must reconstruct nposting and
+ * newitem by repeating the actions of our caller (i.e. by passing
+ * original posting list and orignewitem to _bt_posting_split()).
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2171,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2527,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2304,6 +2642,394 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free some space. If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead. This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ Size newitemsz)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->deduplicate = true;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page.
+ * NOTE It's essential to calculate max offset on each iteration,
+ * since it could have changed if several items were replaced with a
+ * single posting tuple.
+ */
+ offnum = minoff;
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data
+ * item as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (state->deduplicate &&
+ _bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending posting
+ * list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached.
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page,
+ * otherwise, just reset the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+ /*
+ * When we have deduplicated enough to avoid page split, don't
+ * bother merging together existing tuples to create new posting
+ * lists.
+ *
+ * Note: We deliberately add as many heap TIDs as possible to a
+ * pending posting list by performing this check at this point
+ * (just before a new pending posting lists is created). It would
+ * be possible to make the final new posting list for each
+ * successful page deduplication operation as small as possible
+ * while still avoiding a page split for caller. We don't want to
+ * repeatedly merge posting lists around the same range of heap
+ * TIDs, though.
+ *
+ * (Besides, the total number of new posting lists created is the
+ * cost that this check is supposed to minimize -- there is no
+ * great reason to be concerned about the absolute number of
+ * existing tuples that can be killed/replaced.)
+ */
+#if 0
+ /* Actually, don't do that */
+ /* TODO: Make a final decision on this */
+ if (pagesaving >= newitemsz)
+ state->deduplicate = false;
+#endif
+
+ /* Continue iteration from base tuple's offnum */
+ offnum = state->baseoff;
+
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /*
+ * Handle the last item, if pending posting list is not empty.
+ */
+ if (state->nitems != 0)
+ pagesaving += _bt_dedup_finish_pending(buffer, state, RelationNeedsWAL(rel));
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ if (state->nitems > 1)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = BTreeFormPostingTuple(state->base, state->htids,
+ state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..1b1134c2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,17 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level, bool dedup_is_possible)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +69,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_dedup_is_possible = dedup_is_possible;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -213,6 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -394,6 +402,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -684,6 +693,59 @@ _bt_heapkeyspace(Relation rel)
}
/*
+ * _bt_get_dedupispossible() -- is deduplication possible for the index?
+ * get information from metapage
+ */
+bool
+_bt_getdedupispossible(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_dedup_is_possible;;
+ }
+
+ /*
+ * Cache the metapage data for next time
+ *
+ * An on-the-fly version upgrade performed by _bt_upgrademetapage()
+ * can change the nbtree version for an index without invalidating any
+ * local cache. This is okay because it can only happen when moving
+ * from version 2 to version 3, both of which are !heapkeyspace
+ * versions.
+ */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_dedup_is_possible;
+}
+
+/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
void
@@ -983,14 +1045,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1120,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1135,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1042,6 +1157,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
}
/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
+/*
* Delete item(s) from a btree page during single-page cleanup.
*
* As above, must only be used on leaf pages.
@@ -1067,8 +1267,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2266,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..0d89961 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -157,10 +159,11 @@ void
btbuildempty(Relation index)
{
Page metapage;
+ bool dedup_is_possible = _bt_dedup_is_possible(index);
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, dedup_is_possible);
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +266,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +819,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +830,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1072,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1192,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1242,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1255,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1280,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1352,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1368,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1379,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1394,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1404,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1376,6 +1457,68 @@ restart:
}
/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..9022ee6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -529,6 +551,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
}
/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
+/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
* page/offnum: location of btree item to be compared to.
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,24 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1451,6 +1560,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1595,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1650,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1658,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1700,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1743,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1757,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1611,6 +1772,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a base version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
+/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
* On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..cff252b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -725,7 +728,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1058,6 +1064,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
}
/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+
+ Assert(dstate->nitems > 0);
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ }
+
+ _bt_buildadd(wstate, state, final);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
+/*
* Finish writing out the completed btree.
*/
static void
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+
+ _bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->dedup_is_possible);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->dedup_is_possible
+ && BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1152,6 +1199,13 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* btspool and btspool2.
*/
+ /*
+ * Unique indexes may support deduplication, but this case
+ * it seems unworthy.
+ * TODO Probably we can just delete the assertion.
+ */
+ deduplicate = false;
+ Assert(!deduplicate);
/* the preparation of merge */
itup = tuplesort_getindextuple(btspool->sortstate, true);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
@@ -1255,9 +1309,94 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->deduplicate = true; /* unused */
+ dstate->maxitemsize = 0; /* set later */
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+ /* Conservatively size array */
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been
+ * saved in state. The next iteration will also end up here
+ * if it's possible to merge the next tuple into the same
+ * pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..df976d4 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd..e6a64f8 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,23 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+ /* get information from relation info or from btree metapage */
+ key->dedup_is_possible = (itup == NULL) ? _bt_dedup_is_possible(rel) :
+ _bt_getdedupispossible(rel);
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1398,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1560,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1800,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2027,7 +2066,30 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ relopt_value *options;
+ BtreeOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, do_deduplication)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ return NULL;
+
+ rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+ return (bytea *) rdopts;
}
/*
@@ -2140,6 +2202,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2236,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2245,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2270,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2302,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2313,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2331,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2340,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2422,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2465,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2548,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2615,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2646,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2720,115 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple. Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple. It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+bool
+_bt_dedup_is_possible(Relation index)
+{
+ int dedup_is_possible = false;
+
+ if (IndexRelationGetNumberOfAttributes(index)
+ == IndexRelationGetNumberOfKeyAttributes(index))
+ {
+ int i;
+
+ dedup_is_possible = true;
+
+ for (i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ // TODO add adequate check of opclasses and collations
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+ if (opfamily == 1988) //NUMERIC BTREE OPFAMILY
+ {
+ return false;
+ }
+ }
+ }
+
+ return dedup_is_possible;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..3489cf2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_dedup_is_possible = xlrec->btm_dedup_is_possible;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_posting_split() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +306,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ /*
+ * Use _bt_posting_split() to repeat posting list split steps
+ * from primary
+ */
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +367,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -380,14 +455,89 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->deduplicate = true; /* unused */
+ state->maxitemsize = BTMaxItemSize(page);
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval
+ * to deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
+static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +628,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
+
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +990,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1010,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1038,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04..1dde2da 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84..3ef752c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,40 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_dedup_is_possible; /* whether the deduplication
+ * can be applied to the index */
} BTMetaPageData;
#define BTPageGetMeta(p) \
((BTMetaPageData *) PageGetContents(p))
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor;
+ double vacuum_cleanup_index_scale_factor;
+ bool do_deduplication;
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->do_deduplication : BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.
@@ -234,8 +263,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +280,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +341,149 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ /* Deduplication status info for entire page/operation */
+ bool deduplicate; /* Still deduplicating page? */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # valid heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+
+ /* Metadata about based tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* original page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +512,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -472,6 +691,7 @@ typedef struct BTScanInsertData
bool anynullkeys;
bool nextkey;
bool pivotsearch;
+ bool dedup_is_possible;
ItemPointer scantid; /* tiebreaker for scankeys */
int keysz; /* Size of scankeys array */
ScanKeyData scankeys[INDEX_MAX_KEYS]; /* Must appear last */
@@ -500,6 +720,13 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -534,7 +761,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +792,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +811,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +963,14 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState* state, bool need_wal);
/*
* prototypes for functions in nbtsplitloc.c
@@ -743,7 +982,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool dedup_is_possible);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +991,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_getdedupispossible(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1003,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1055,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern bool _bt_dedup_is_possible(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee0..71f6568 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_dedup_is_possible;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2c..2b8c6c7 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Attached is v19.
Cool.
* By default deduplication is on for non-unique indexes and off for
unique ones.
I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.
The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).
Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?
When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).
This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!
Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:
* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).
* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).
(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)
The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad. If we have a few
posting lists that are not so big for each distinct value, we can
often kill most of them within _bt_vacuum_one_page(), which is very
good, and has minimal downside (i.e. we still get most of the benefits
of aggressive deduplication).
Interestingly, these non-HOT page splits all seem to "come in waves".
I noticed this because I carefully monitored the benchmark/test case
over time. The patch doesn't prevent the "waves of page splits"
pattern, but it does make it much much less noticeable.
* New function _bt_dedup_is_possible() is intended to be a single place
to perform all the checks. Now it's just a stub to ensure that it works.Is there a way to extract this from existing opclass information,
or we need to add new opclass field? Have you already started this work?
I recall there was another thread, but didn't manage to find it.
The thread is here:
/messages/by-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
--
Peter Geoghegan
On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates.
Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.
The new checkingunique heuristics added to _bt_dedup_one_page() were
developed experimentally, based on pgbench tests. The general idea
with the new checkingunique stuff is to make deduplication *extremely*
lazy. We want to avoid making _bt_vacuum_one_page() garbage collection
less effective by being too aggressive with deduplication -- workloads
with lots of non-HOT-updates into unique indexes are greatly dependent
on the LP_DEAD bit setting in _bt_check_unique(). At the same time,
_bt_dedup_one_page() can be just as effective at delaying page splits
as it is with non-unique indexes.
I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours. I spent
nearly 24 hours running the test at full speed (no throttling this
time), at scale 500, and with very very aggressive autovacuum settings
(autovacuum_vacuum_cost_delay=0ms,
autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with
alternating runs of 4, 8, and 16 clients. Towards the end, the patch
had about 5% greater throughput at lower client counts, and never
seemed to be significantly slower (it was very slightly slower once or
twice, but I think that that was just noise).
More importantly, the indexes looked like this on master:
bloated_abalance: 3017 MB
pgbench_accounts_pkey: 2142 MB
pgbench_branches_pkey: 1352 kB
pgbench_tellers_pkey: 3416 kB
And like this with the patch:
bloated_abalance: 1015 MB
pgbench_accounts_pkey: 1745 MB
pgbench_branches_pkey: 296 kB
pgbench_tellers_pkey: 888 kB
* bloated_abalance is about 3x smaller here, as usual -- no surprises there.
* pgbench_accounts_pkey is the most interesting case.
You might think that it isn't that great that pgbench_accounts_pkey is
1745 MB with the patch, since it starts out at only 1071 MB (and would
go back down to 1071 MB again if we were to do a REINDEX). However,
you have to bear in mind that it takes a long time for it to get that
big -- the growth over time is very important here. Even after the
first run with 16 clients, it only reached 1160 MB -- that's an
increase of ~8%. The master case had already reached 2142 MB ("bloat
saturation point") by then, though. I could easily have stopped the
benchmark there, or used rate-limiting, or excluded the 16 client case
-- that would have allowed me to claim that the growth was under 10%
for a workload where the master case has an index that doubles in
size. On the other hand, if autovacuum wasn't configured to run very
frequently, then the patch wouldn't look nearly this good.
Deduplication helped autovacuum by "soaking up" the "recently dead"
index tuples that cannot be killed right away. In short, the patch
ameliorates weaknesses of the existing garbage collection mechanisms
without changing them. The patch smoothed out the growth of
pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB
after the first 3 hours/first 16 client run. It was 1356 MB after the
second 16 client run (i.e. after running another round of one hour
4/8/16 client runs), finally finishing up at 1745 MB. So the growth in
the size of pgbench_accounts_pkey for the patch was significantly
improved, and the *rate* of growth over time was also improved.
The master branch had a terrible jerky growth in the size of
pgbench_accounts_pkey. The master branch did mostly keep up at first
(i.e. the size of pgbench_accounts_pkey wasn't too different at
first). But once we got to 16 clients for the first time, after a
couple of hours, pgbench_accounts_pkey almost doubled in size over a
period of only 10 or 20 minutes! The index size *exploded* in a very
short period of time, starting only a few hours into the benchmark.
(Maybe we don't see this anything like this with the patch because
with the patch backends are more concerned about helping VACUUM, and
less concerned about creating a mess that VACUUM must clean up. Not
sure.)
* We also manage to make the small pgbench indexes
(pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here
(without doing anything to force more non-HOT updates on the
underlying tables).
This result for the two small indexes looks good, but I should point
out that we still only fit ~15 or so tuples on each leaf page with the
patch when everything is over -- far far less than the number that
CREATE INDEX stored on the leaf pages immediately (it leaves 366 items
on each leaf page). This is kind of an extreme case, because there is
so much contention, but space utilization with the patch is actually
very bad here. The master branch is very very very bad, though, so
we're at least down to only a single "very" here. Progress.
Any thoughts on the approach taken for unique indexes within
_bt_dedup_one_page() in v20? Obviously that stuff needs to be examined
critically -- it's possible that it wouldn't do as well as it could or
should with other workloads that I haven't thought about. Please take
a look at the details.
--
Peter Geoghegan
Attachments:
v20-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v20-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 5662b08800e0caef28ebe8d27c3512a993d40130 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v20 2/2] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 92 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 109 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 8d27c9b0f6..e88875107f 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -29,6 +29,7 @@
#include "pageinspect.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -243,6 +244,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -254,9 +256,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -265,6 +267,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid,
+ max_htid;
id = PageGetItemId(page, offset);
@@ -283,16 +287,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxHeapTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -366,11 +431,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -397,12 +462,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -482,7 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v20-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v20-0001-Add-deduplication-to-nbtree.patchDownload
From 4fd6fa5c21b79f56f5d3f8f8881778a3d8fb82c5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v20 1/2] Add deduplication to nbtree
---
contrib/amcheck/verify_nbtree.c | 164 ++++-
src/backend/access/common/reloptions.c | 11 +-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/README | 74 +-
src/backend/access/nbtree/nbtinsert.c | 860 +++++++++++++++++++++++-
src/backend/access/nbtree/nbtpage.c | 211 +++++-
src/backend/access/nbtree/nbtree.c | 175 ++++-
src/backend/access/nbtree/nbtsearch.c | 244 ++++++-
src/backend/access/nbtree/nbtsort.c | 144 +++-
src/backend/access/nbtree/nbtsplitloc.c | 49 +-
src/backend/access/nbtree/nbtutils.c | 326 ++++++++-
src/backend/access/nbtree/nbtxlog.c | 222 +++++-
src/backend/access/rmgrdesc/nbtdesc.c | 28 +-
src/include/access/nbtree.h | 319 ++++++++-
src/include/access/nbtxlog.h | 68 +-
src/include/access/rmgrlist.h | 2 +-
src/tools/valgrind.supp | 21 +
17 files changed, 2732 insertions(+), 190 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..bdb0ede577 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return BTreeFormPostingTuple(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2666,18 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ /* Shouldn't be called with heapkeyspace index */
+ Assert(state->heapkeyspace);
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c00fe..e6448e4a86 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1513,8 +1522,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..3d213dfd2d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,27 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +129,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +307,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -428,14 +435,36 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
if (!ItemIdIsDead(curitemid))
{
ItemPointerData htid;
+ bool posting;
bool all_dead;
+ bool posting_all_dead;
+ int npost;
+
if (_bt_compare(rel, itup_key, page, offset) != 0)
break; /* we're past all the equal tuples */
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ posting = false;
+ posting_all_dead = true;
+ }
+ else
+ {
+ posting = true;
+ /* Initial assumption */
+ posting_all_dead = true;
+ }
+
+ npost = 0;
+ doposttup:
+ if (posting)
+ htid = *BTreeTupleGetPostingN(curitup, npost);
+
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -446,6 +475,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
ItemPointerCompare(&htid, &itup->t_tid) == 0)
{
found = true;
+ posting_all_dead = false;
+ if (posting)
+ goto nextpost;
}
/*
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,7 +601,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && !posting)
{
/*
* The conflicting tuple (or whole HOT chain) is dead to
@@ -589,6 +620,35 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+ else if (posting)
+ {
+ nextpost:
+ if (!all_dead)
+ posting_all_dead = false;
+
+ /* Iterate over single posting list tuple */
+ npost++;
+ if (npost < BTreeTupleGetNPosting(curitup))
+ goto doposttup;
+
+ /*
+ * Mark posting tuple dead if all hot chains whose root is
+ * contained in posting tuple have tuples that are all
+ * dead
+ */
+ if (posting_all_dead)
+ {
+ ItemIdMarkDead(curitemid);
+ opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+ if (nbuf != InvalidBuffer)
+ MarkBufferDirtyHint(nbuf, true);
+ else
+ MarkBufferDirtyHint(insertstate->buf, true);
+ }
+
+ /* Move on to next index tuple */
+ }
}
}
@@ -689,6 +749,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +812,26 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->dedup_is_possible &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +913,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->dedup_is_possible);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -900,15 +1005,81 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Form a new posting list during a posting split.
+ *
+ * If caller determines that its new tuple 'newitem' is a duplicate with a
+ * heap TID that falls inside the range of an existing posting list tuple
+ * 'oposting', it must generate a new posting tuple to replace the original.
+ * The new posting list is guaranteed to be the same size as the original.
+ * Caller must also change newitem to have the heap TID of the rightmost TID
+ * in the original posting list. Both steps are always handled by calling
+ * here.
+ *
+ * Returns new posting list palloc()'d in caller's context. Also modifies
+ * caller's newitem to contain final/effective heap TID, which is what caller
+ * actually inserts on the page.
+ *
+ * Exported for use by recovery. Note that recovery path must recreate the
+ * same version of newitem that is passed here on the primary, even though
+ * that differs from the final newitem actually added to the page. This
+ * optimization avoids explicit WAL-logging of entire posting lists, which
+ * tend to be rather large.
+ */
+IndexTuple
+_bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID).
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /*
+ * Fill the gap with the TID of the new item.
+ */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /*
+ * Copy original (not new original) posting list's last TID into new item
+ */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1089,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1108,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1130,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1142,46 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple, so split posting list.
+ *
+ * Posting list splits always replace some existing TID in the posting
+ * list with the new item's heap TID (based on a posting list offset
+ * from caller) by removing rightmost heap TID from posting list. The
+ * new item's heap TID is swapped with that rightmost heap TID, almost
+ * as if the tuple inserted never overlapped with a posting list in
+ * the first place. This allows the insertion and page split code to
+ * have minimal special case handling of posting lists.
+ *
+ * The only extra handling required is to overwrite the original
+ * posting list with nposting, which is guaranteed to be the same size
+ * as the original, keeping the page space accounting simple. This
+ * takes place in either the page insert or page split critical
+ * section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID to write it into xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_posting_split(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter new item offset, since effective new item changed */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1214,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1294,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1347,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1376,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1385,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1439,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1461,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1501,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1549,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1623,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1659,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1769,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1954,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1978,46 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in record, though.
+ *
+ * The details are often slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery can't even tell the difference). Otherwise, we set
+ * postingoff and log orignewitem instead of newitem, despite having
+ * actually inserted newitem. Recovery must reconstruct nposting and
+ * newitem by repeating the actions of our caller (i.e. by passing
+ * original posting list and orignewitem to _bt_posting_split()).
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2177,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2533,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2304,6 +2648,472 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split. (We'll have to kill LP_DEAD
+ * items here when the page's BTP_HAS_GARBAGE hint was not set, but that
+ * should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+retry:
+ offnum = minoff;
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending posting
+ * list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached.
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page, otherwise, just
+ * reset the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication. Don't
+ * prematurely deduplicate items that could still have their
+ * LP_DEAD bits set.
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /* Continue iteration from base tuple's offnum */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ if (state->checkingunique && pagesaving < newitemsz)
+ {
+ /*
+ * Try again. The second pass over the page may deduplicate items
+ * that were passed over the first time due to concerns about limiting
+ * the effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that we will still stop deduplicating as soon as enough space
+ * has been freed to avoid caller's page split.
+ *
+ * FIXME: Don't bother with this when it's clearly a total waste of
+ * time. Maybe don't do any checkingunique deduplication for the
+ * rightmost page, either.
+ */
+ state->checkingunique = false;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ *
+ * Exported for use by nbtsort.c and recovery.
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique && BTreeTupleIsPosting(state->base))
+ return false;
+ if (state->checkingunique && nhtids > 1)
+ return false;
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ *
+ * Exported for use by recovery.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique ||
+ state->nitems == 1 || state->nhtids == state->nitems);
+ if (state->checkingunique)
+ minimum = 3;
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = BTreeFormPostingTuple(state->base, state->htids,
+ state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..c08f850595 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,17 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level, bool dedup_is_possible)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +69,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_dedup_is_possible = dedup_is_possible;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -213,6 +220,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -394,6 +402,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -683,6 +692,63 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_get_dedupispossible() -- is deduplication possible for the index?
+ * get information from metapage
+ */
+bool
+_bt_getdedupispossible(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+ * here, and the need for aa version bump to go with new metapage
+ * field.
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_dedup_is_possible;;
+ }
+
+ /*
+ * Cache the metapage data for next time
+ *
+ * An on-the-fly version upgrade performed by _bt_upgrademetapage()
+ * can change the nbtree version for an index without invalidating any
+ * local cache. This is okay because it can only happen when moving
+ * from version 2 to version 3, both of which are !heapkeyspace
+ * versions.
+ */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_dedup_is_possible;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_dedup_is_possible = metad->btm_dedup_is_possible;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..d70607e71a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -157,10 +159,11 @@ void
btbuildempty(Relation index)
{
Page metapage;
+ bool dedup_is_possible = _bt_dedup_is_possible(index);
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, dedup_is_possible);
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +266,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +819,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +830,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1072,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1192,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1242,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1255,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1280,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = BTreeFormPostingTuple(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1352,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1368,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1379,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1394,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1404,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1456,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..9db73d070d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1233,6 +1343,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
+ inskey.dedup_is_possible = false;
inskey.scantid = NULL;
inskey.keysz = keysCount;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a base version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..a138fafeb1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -725,8 +728,8 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
- BTREE_DEFAULT_FILLFACTOR);
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+ BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like nbtinsert.c's _bt_dedup_finish_pending(), but it adds a
+ * new tuple using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+
+ Assert(dstate->nitems > 0);
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ }
+
+ _bt_buildadd(wstate, state, final);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+
+ _bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->dedup_is_possible);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->dedup_is_possible &&
+ BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1255,9 +1302,96 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+ /* Conservatively size array */
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been
+ * saved in state. The next iteration will also end up here
+ * if it's possible to merge the next tuple into the same
+ * pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b6c4..df976d4b7a 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..6fdd776ea5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,23 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+ /* get information from relation info or from btree metapage */
+ key->dedup_is_possible = (itup == NULL) ? _bt_dedup_is_possible(rel) :
+ _bt_getdedupispossible(rel);
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1398,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1560,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1800,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2027,7 +2066,30 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ relopt_value *options;
+ BtreeOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, do_deduplication)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ return NULL;
+
+ rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+ return (bytea *) rdopts;
}
/*
@@ -2140,6 +2202,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2236,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2245,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2270,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2302,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2313,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2331,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2340,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2422,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts(). This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple. Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2349,8 +2465,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
if (isNull1 != isNull2)
break;
+ /*
+ * XXX: The ideal outcome from the point of view of the posting list
+ * patch is that the definition of an opclass with "precise equality"
+ * becomes: "equality operator function must give exactly the same
+ * answer as datum_image_eq() would, provided that we aren't using a
+ * nondeterministic collation". (Nondeterministic collations are
+ * clearly not compatible with deduplication.)
+ *
+ * This will be a lot faster than actually using the authoritative
+ * insertion scankey in some cases. This approach also seems more
+ * elegant, since suffix truncation gets to follow exactly the same
+ * definition of "equal" as posting list deduplication -- there is a
+ * subtle interplay between deduplication and suffix truncation, and
+ * it would be nice to know for sure that they have exactly the same
+ * idea about what equality is.
+ *
+ * This ideal outcome still avoids problems with TOAST. We cannot
+ * repeat bugs like the amcheck bug that was fixed in bugfix commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. datum_image_eq()
+ * considers binary equality, though only _after_ each datum is
+ * decompressed.
+ *
+ * If this ideal solution isn't possible, then we can fall back on
+ * defining "precise equality" as: "type's output function must
+ * produce identical textual output for any two datums that compare
+ * equal when using a safe/equality-is-precise operator class (unless
+ * using a nondeterministic collation)". That would mean that we'd
+ * have to make deduplication call _bt_keep_natts() instead (or some
+ * other function that uses authoritative insertion scankey).
+ */
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2548,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2615,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2646,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2720,119 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list, build a
+ * posting tuple. Caller's "htids" array must be sorted in ascending order.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it, all
+ * ItemPointers must be passed via htids.
+ *
+ * If nhtids == 1, just build a non-posting tuple. It is necessary to avoid
+ * storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_dedup_is_possible(Relation index)
+{
+ int dedup_is_possible = false;
+
+ if (IndexRelationGetNumberOfAttributes(index) ==
+ IndexRelationGetNumberOfKeyAttributes(index))
+ {
+ int i;
+
+ dedup_is_possible = true;
+
+ for (i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+ /* NUMERIC BTREE OPFAMILY OID is 1988 */
+ if (opfamily == 1988)
+ {
+ return false;
+ }
+ }
+ }
+
+ return dedup_is_possible;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..747ab4235c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_dedup_is_possible = xlrec->btm_dedup_is_possible;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,46 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_posting_split() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +306,42 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ /*
+ * Use _bt_posting_split() to repeat posting list split steps
+ * from primary
+ */
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_posting_split() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_posting_split(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +367,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +454,83 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +538,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +630,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +992,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1012,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1040,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..593f74c26e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,43 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_dedup_is_possible; /* whether the deduplication can be
+ * applied to the index */
} BTMetaPageData;
#define BTPageGetMeta(p) \
((BTMetaPageData *) PageGetContents(p))
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor;
+ double vacuum_cleanup_index_scale_factor;
+ bool do_deduplication;
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now. Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->do_deduplication : BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.
@@ -234,8 +266,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +283,38 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples. posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +344,152 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+ IndexTuple newitem;
+ bool checkingunique;
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # valid heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +518,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -472,6 +697,7 @@ typedef struct BTScanInsertData
bool anynullkeys;
bool nextkey;
bool pivotsearch;
+ bool dedup_is_possible;
ItemPointer scantid; /* tiebreaker for scankeys */
int keysz; /* Size of scankeys array */
ScanKeyData scankeys[INDEX_MAX_KEYS]; /* Must appear last */
@@ -499,6 +725,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +767,9 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +798,13 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times (once per heap TID in posting
+ * list).
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +817,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -730,8 +969,15 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
*/
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
+extern IndexTuple _bt_posting_split(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+ bool need_wal);
/*
* prototypes for functions in nbtsplitloc.c
@@ -743,7 +989,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool dedup_is_possible);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +998,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_getdedupispossible(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1010,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1062,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern bool _bt_dedup_is_possible(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..71f6568234 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_dedup_is_possible;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a228ae..71a03e39d3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
Memcheck:Cond
fun:PyObject_Realloc
}
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case. datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+ temporary_workaround_1
+ Memcheck:Addr1
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
+
+{
+ temporary_workaround_8
+ Memcheck:Addr8
+ fun:bcmp
+ fun:datum_image_eq
+ fun:_bt_keep_natts_fast
+}
--
2.17.1
On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours.
I ran pgbench without the pgbench_accounts(abalance) index, and with
slightly adjusted client counts -- you could say that this was a
classic pgbench benchmark of v20 of the patch. Still scale 500, with
single hour runs.
Here are the results for each 1 hour run, with client counts of 8, 16,
and 32, with two rounds of runs total:
master_1_run_8.out: "tps = 25156.689415 (including connections establishing)"
patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)"
master_1_run_16.out: "tps = 30947.053714 (including connections establishing)"
patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)"
master_1_run_32.out: "tps = 29550.231969 (including connections establishing)"
patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)"
master_2_run_8.out: "tps = 24678.792084 (including connections establishing)"
patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)"
master_2_run_16.out: "tps = 30878.930585 (including connections establishing)"
patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)"
master_2_run_32.out: "tps = 29555.453436 (including connections establishing)"
patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)"
This interlaced order is the same order that each 1 hour pgbench run
actually ran in. The patch wasn't expected to do any better here -- it
was expected to not be any slower for a workload that it cannot really
help. Though the two small pgbench indexes do remain a lot smaller
with the patch.
While a lot of work remains to validate the performance of the patch,
this looks good to me.
--
Peter Geoghegan
On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.
Attached is v21, which fixes some bitrot -- v20 of the patch was made
totally unusable by today's commit 8557a6f1. Other changes:
* New datum_image_eq() patch fixes up datum_image_eq() to work with
cstring/name columns, which we rely on. No need for a Valgrind
suppressions anymore. The suppression was only needed to paper over
the fact that datum_image_eq() would not really work properly with
cstring datums (the suppression was papering over a legitimate
complaint, but we fix the underlying problem with 8557a6f1 and the
v21-0001-* patch).
* New nbtdedup.c file added. This has all of the functions that dealt
with deduplication and posting lists that were previously in
nbtinsert.c and nbtutils.c. I think that this separation is somewhat
cleaner.
* Additional tweaks to the custom checkingunique algorithm used by
deduplication. This is based on further tuning from benchmarking. This
is certainly not final yet.
* Greatly simplified the code for unique index LP_DEAD killing in
_bt_check_unique(). This was pretty sloppy in v20 of the patch (it had
two "goto" labels). Now it works with the existing loop conditions
that advance to the next equal item on the page.
* Additional adjustments to the nbtree.h comments about the on-disk format.
Can you take a quick look at the first patch (the v21-0001-* patch),
Anastasia? I would like to get that one out of the way soon.
--
Peter Geoghegan
Attachments:
v21-0001-Teach-datum_image_eq-about-cstring-datums.patchapplication/x-patch; name=v21-0001-Teach-datum_image_eq-about-cstring-datums.patchDownload
From 49d1be9007130c0e80e423f99c7b043df654b0cc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 4 Nov 2019 09:07:13 -0800
Subject: [PATCH v21 1/3] Teach datum_image_eq() about cstring datums.
An upcoming patch to add deduplication to nbtree indexes needs to be
able to use datum_image_eq() as a drop-in replacement for opclass
equality in certain contexts. This includes comparisons of TOASTable
datatypes such as text (at least when deterministic collations are in
use), and cstring datums in system catalog indexes. cstring is used as
the storage type of "name" columns when indexed by nbtree, despite the
fact that cstring is a pseudo-type.
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
src/backend/utils/adt/datum.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 73703efe05..b20d0640ea 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -263,6 +263,8 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
bool
datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
{
+ Size len1,
+ len2;
bool result = true;
if (typByVal)
@@ -277,9 +279,6 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
}
else if (typLen == -1)
{
- Size len1,
- len2;
-
len1 = toast_raw_datum_size(value1);
len2 = toast_raw_datum_size(value2);
/* No need to de-toast if lengths don't match. */
@@ -304,6 +303,20 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
pfree(arg2val);
}
}
+ else if (typLen == -2)
+ {
+ char *s1,
+ *s2;
+
+ /* Compare cstring datums */
+ s1 = DatumGetCString(value1);
+ s2 = DatumGetCString(value2);
+ len1 = strlen(s1) + 1;
+ len2 = strlen(s2) + 1;
+ if (len1 != len2)
+ return false;
+ result = (memcmp(s1, s2, len1) == 0);
+ }
else
elog(ERROR, "unexpected typLen: %d", typLen);
--
2.17.1
v21-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/x-patch; name=v21-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload
From d72829b729b2e028048ebc9fcbdb9a7f47c724b8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v21 3/3] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 92 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 109 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
#include "postgres.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -252,9 +254,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid,
+ max_htid;
id = PageGetItemId(page, offset);
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxHeapTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v21-0002-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v21-0002-Add-deduplication-to-nbtree.patchDownload
From f241bf58420665adff152e3bc4389a119977b24f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v21 2/3] Add deduplication to nbtree
---
src/include/access/nbtree.h | 324 ++++++++++--
src/include/access/nbtxlog.h | 68 ++-
src/include/access/rmgrlist.h | 2 +-
src/backend/access/common/reloptions.c | 11 +-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/Makefile | 2 +-
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtdedup.c | 633 ++++++++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c | 330 ++++++++++--
src/backend/access/nbtree/nbtpage.c | 209 +++++++-
src/backend/access/nbtree/nbtree.c | 174 ++++++-
src/backend/access/nbtree/nbtsearch.c | 244 ++++++++-
src/backend/access/nbtree/nbtsort.c | 145 +++++-
src/backend/access/nbtree/nbtsplitloc.c | 49 +-
src/backend/access/nbtree/nbtutils.c | 217 ++++++--
src/backend/access/nbtree/nbtxlog.c | 218 +++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 28 +-
contrib/amcheck/verify_nbtree.c | 177 +++++--
18 files changed, 2705 insertions(+), 204 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..56ab23ad79 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -107,11 +107,43 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_safededup; /* deduplication safe for index? */
} BTMetaPageData;
#define BTPageGetMeta(p) \
((BTMetaPageData *) PageGetContents(p))
+/* Storage type for Btree's reloptions */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor;
+ double vacuum_cleanup_index_scale_factor;
+ bool dedup_enabled; /* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now. Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->dedup_enabled : \
+ BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/*
* The current Btree version is 4. That's what you'll get when you create
* a new index.
@@ -234,8 +266,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -282,20 +313,176 @@ typedef struct BTMetaPageData
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
*
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID. Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple. A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used). Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ * t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid. These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use. Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples. htids is an array of
+ * ItemPointers for pending posting list.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a "base" tuple, then compare the next one with it.
+ * If tuples are equal, save their TIDs in the posting list.
+ */
+typedef struct BTDedupState
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* BTMaxItemSize() limit for page */
+ IndexTuple newitem;
+ bool checkingunique; /* Use unique index strategy? */
+ OffsetNumber skippedbase; /* First offset skipped by checkingunique */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +513,73 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any. Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return (ItemPointer) (BTreeTupleGetPosting(itup) +
+ (BTreeTupleGetNPosting(itup) - 1));
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -434,6 +654,11 @@ typedef BTStackData *BTStack;
* indexes whose version is >= version 4. It's convenient to keep this close
* by, rather than accessing the metapage repeatedly.
*
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use). This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
* anynullkeys indicates if any of the keys had NULL value when scankey was
* built from index tuple (note that already-truncated tuple key attributes
* set NULL as a placeholder key value, which also affects value of
@@ -469,6 +694,7 @@ typedef BTStackData *BTStack;
typedef struct BTScanInsertData
{
bool heapkeyspace;
+ bool safededup;
bool anynullkeys;
bool nextkey;
bool pivotsearch;
@@ -499,6 +725,13 @@ typedef struct BTInsertStateData
/* Buffer containing leaf page we're likely to insert itup on */
Buffer buf;
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
+
/*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
@@ -534,7 +767,10 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -563,9 +799,12 @@ typedef struct BTScanPosData
/*
* If we are doing an index-only scan, nextTupleOffset is the first free
- * location in the associated tuple storage workspace.
+ * location in the associated tuple storage workspace. Posting list
+ * tuples need postingTupleOffset to store the current location of the
+ * tuple that is returned multiple times.
*/
int nextTupleOffset;
+ int postingTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -578,7 +817,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -725,6 +964,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+ bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
+
/*
* prototypes for functions in nbtinsert.c
*/
@@ -743,7 +998,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1007,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1019,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1071,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_safededup;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c00fe..e6448e4a86 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1513,8 +1522,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index 9aab9cf64a..8140b08777 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/nbtree
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-OBJS = nbtcompare.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
+OBJS = nbtcompare.o nbtdedup.o nbtinsert.o nbtpage.o nbtree.o nbtsearch.o \
nbtsplitloc.o nbtutils.o nbtsort.o nbtvalidate.o nbtxlog.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..c8a63f9617
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,633 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ * Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ state->skippedbase = InvalidOffsetNumber;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+ offnum = minoff;
+retry:
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state. The next iteration
+ * will also end up here if it's possible to merge the next tuple
+ * into the same pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed BTMaxItemSize()
+ * limit).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page. Otherwise, reset
+ * the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication in case
+ * the remaining items are about to be marked dead within
+ * _bt_check_unique().
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /*
+ * Next iteration starts immediately after base tuple offset (this
+ * will be the next offset on the page when we didn't modify the
+ * page)
+ */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+ {
+ /*
+ * Didn't free enough space for new item in first checkingunique pass.
+ * Try making a second pass over the page, this time starting from the
+ * first candidate posting list base offset that was skipped over in
+ * the first pass (only do a second pass when this actually happened).
+ *
+ * The second pass over the page may deduplicate items that were
+ * initially passed over due to concerns about limiting the
+ * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that the second pass will still stop deduplicating as soon as
+ * enough space has been freed to avoid an immediate page split.
+ */
+ Assert(state->checkingunique);
+ offnum = state->skippedbase;
+
+ state->checkingunique = false;
+ state->skippedbase = InvalidOffsetNumber;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique &&
+ (BTreeTupleIsPosting(state->base) || nhtids > 1))
+ {
+ /* May begin here if second pass over page is required */
+ if (state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ return false;
+ }
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique || state->nitems == 1 ||
+ state->nhtids == state->nitems);
+ if (state->checkingunique)
+ {
+ minimum = 3;
+ /* May begin here if second pass over page is required */
+ if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ }
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = _bt_form_posting(state->base, state->htids, state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order. Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'. Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps. (This approach avoids any explicit WAL-logging of a
+ * posting list tuple. This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID)
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /* Fill the gap with the TID of the new item */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /* Copy original posting list's rightmost TID into new item */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..0a866b832e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +126,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
/* PageAddItem will MAXALIGN(), but be consistent */
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = itup_key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
@@ -300,7 +304,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ bool inposting = false;
+ bool prev_all_dead = true;
+ int curposti = 0;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/*
* Scan over all equal tuples, looking for live conflicts.
+ *
+ * Note that each iteration of the loop processes one heap TID, not one
+ * index tuple. The page offset number won't be advanced for iterations
+ * which process heap TIDs from posting list tuples until the last such
+ * heap TID for the posting list (curposti will be advanced instead).
*/
Assert(!insertstate->bounds_valid || insertstate->low == offset);
Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ /*
+ * decide if this is the first heap TID in tuple we'll
+ * process, or if we should continue to process current
+ * posting list
+ */
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ inposting = false;
+ }
+ else if (!inposting)
+ {
+ /* First heap TID in posting list */
+ inposting = true;
+ prev_all_dead = true;
+ curposti = 0;
+ }
+
+ if (inposting)
+ htid = *BTreeTupleGetPostingN(curitup, curposti);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && (!inposting ||
+ (prev_all_dead &&
+ curposti == BTreeTupleGetNPosting(curitup) - 1)))
{
/*
- * The conflicting tuple (or whole HOT chain) is dead to
- * everyone, so we may as well mark the index entry
- * killed.
+ * The conflicting tuple (or all HOT chains pointed to by
+ * all posting list TIDs) is dead to everyone, so mark the
+ * index entry killed.
*/
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+
+ /*
+ * Remember if posting list tuple has even a single HOT chain
+ * whose members are not all dead
+ */
+ if (!all_dead && inposting)
+ prev_all_dead = false;
}
}
- /*
- * Advance to next tuple to continue checking.
- */
- if (offset < maxoff)
+ if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+ {
+ /* Advance to next TID in same posting list */
+ curposti++;
+ continue;
+ }
+ else if (offset < maxoff)
+ {
+ /* Advance to next tuple */
+ curposti = 0;
+ inposting = false;
offset = OffsetNumberNext(offset);
+ }
else
{
int highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
+ curposti = 0;
+ inposting = false;
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->safededup &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->safededup);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,43 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple on a leaf page. Prepare to
+ * split an existing posting list by swapping new item's heap TID with
+ * the rightmost heap TID from original posting list, and generating a
+ * new version of the posting list that has new item's heap TID.
+ *
+ * Posting list splits work by modifying the overlapping posting list
+ * as part of the same atomic operation that inserts the "new item".
+ * The space accounting is kept simple, since it does not need to
+ * consider posting list splits at all (this is particularly important
+ * for the case where we also have to split the page). Overwriting
+ * the posting list with its post-split version is treated as an extra
+ * step in either the insert or page split critical section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID for xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter offset so that it goes after existing posting list */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1137,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1217,18 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+ MAXALIGN(IndexTupleSize(nposting)));
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1270,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1299,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1308,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1362,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1384,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1424,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1472,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1546,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1582,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1692,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1877,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1901,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in record, though.
+ *
+ * The details are often slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery can't even tell the difference). Otherwise, we set
+ * postingoff and log orignewitem instead of newitem, despite having
+ * actually inserted newitem. Recovery must reconstruct nposting and
+ * newitem by calling _bt_swap_posting().
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2099,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2455,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2304,6 +2570,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ca25e856e7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = safededup;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,7 @@ _bt_upgrademetapage(Page page)
metad->btm_version = BTREE_NOVAC_VERSION;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = false;
/* Adjust pd_lower (see _bt_initmetapage() for details) */
((PageHeader) page)->pd_lower =
@@ -213,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -394,6 +404,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -683,6 +694,59 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+ * here, and the need for a version bump to go with new metapage
+ * field. I think that we may need to bump the major version because
+ * even v4 indexes (those built on Postgres 12) will have garbage in
+ * the new safedup field. Creating a v5 would mean "new field can be
+ * trusted to not be garbage".
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_safededup;;
+ }
+
+ /* Cache the metapage data for next time */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_safededup;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -983,14 +1047,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1122,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1137,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1158,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1269,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2268,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..d3f1b4ad27 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxPostingIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = _bt_form_posting(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..561b642b1d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+ Assert(!key->nextkey);
+ Assert(key->scantid != NULL);
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
+ inskey.safededup = false; /* unused */
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1773,59 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save a base version of the IndexTuple */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692006..ddf4b164e1 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -287,6 +287,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -725,8 +728,8 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
- BTREE_DEFAULT_FILLFACTOR);
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+ BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -799,7 +802,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -1002,6 +1006,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1043,6 +1048,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1057,6 +1063,42 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+
+ Assert(dstate->nitems > 0);
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = _bt_form_posting(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ }
+
+ _bt_buildadd(wstate, state, final);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1123,7 +1165,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+ _bt_initmetapage(metapage, rootblkno, rootlevel,
+ wstate->inskey->safededup);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1144,6 +1187,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->safededup &&
+ BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1255,9 +1302,97 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->skippedbase = InvalidOffsetNumber;
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ dstate->maxitemsize = BTMaxItemSize(state->btps_page);
+ /* Conservatively size array */
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * BTMaxItemSize() limit. Heap TID(s) for itup have been
+ * saved in state. The next iteration will also end up here
+ * if it's possible to merge the next tuple into the same
+ * pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * BTMaxItemSize() limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..7758d74101 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index bc855dd25d..92c1830d82 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -107,12 +105,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+ key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+ _bt_safededup(rel);
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1386,6 +1397,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1547,6 +1559,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1786,10 +1799,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2027,7 +2065,30 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ relopt_value *options;
+ BtreeOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL, offsetof(BtreeOptions, dedup_enabled)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ return NULL;
+
+ rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+ return (bytea *) rdopts;
}
/*
@@ -2140,6 +2201,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2156,6 +2235,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2163,6 +2244,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2170,7 +2269,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2188,6 +2288,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2200,7 +2301,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2211,9 +2312,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2226,7 +2330,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2235,7 +2339,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2316,15 +2421,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2350,7 +2462,7 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
break;
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2402,22 +2514,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2461,12 +2581,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2492,7 +2612,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2562,11 +2686,44 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.
+ *
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+ /* INCLUDE indexes don't support deduplication */
+ if (IndexRelationGetNumberOfAttributes(index) !=
+ IndexRelationGetNumberOfKeyAttributes(index))
+ return false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+
+ /* NUMERIC btree opfamily OID is 1988 */
+ if (opfamily == 1988)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..27694246e2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_safededup = xlrec->btm_safededup;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_swap_posting() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_swap_posting(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->skippedbase = InvalidOffsetNumber;
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +988,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..4e76c39a6c 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxPostingIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxPostingIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
}
/*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
*/
static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ Assert(state->heapkeyspace);
+
+ /*
+ * Make sure that tuple type (pivot vs non-pivot) matches caller's
+ * expectation
+ */
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
--
2.17.1
On Mon, Nov 4, 2019 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v21, which fixes some bitrot -- v20 of the patch was made
totally unusable by today's commit 8557a6f1. Other changes:
There is more bitrot, so I attach v22. This also has some new changes
centered around fixing particular issues with space utilization. These
changes are:
* nbtsort.c now intelligently considers the contribution of suffix
truncation of posting list tuples when considering whether or not a
leaf page is "full". I mean "full" in the sense that it has exceeded
the soft limit (fillfactor-wise limit) on space utilization for the
page (no change in how the hard limit in _bt_buildadd() works).
We don't usually bother predicting the space saving from suffix
truncation when considering split points, even in nbtsplitloc.c, but
it's worth making an exception for posting lists (actually, this is
the same exception that nbtsplitloc.c already had in much earlier
versions of the patch). Posting lists are very often large enough to
really make a big contribution to how balanced free space is. I now
observe that weird cases where CREATE INDEX packs leaf pages too empty
(or too full) are now all but eliminated. CREATE INDEX now does a
pretty good job of respecting leaf fillfactor, while also allowing
deduplication to be very effective (CREATE INDEX did neither of these
two things in earlier versions of the patch).
* Added "single value" strategy for retail insert deduplication --
this is closely related to nbtsplitloc.c's single value strategy.
The general idea is that _bt_dedup_one_page() anticipates that a
future "single value" page split is likely to occur, and therefore
limits deduplication after two "1/3 of a page"-wide posting lists at
the start of the page. It arranges for deduplication to leave a neat
split point for nbtsplitloc.c to use when the time comes. In other
words, the patch now allows "single value" page splits to leave leaf
pages BTREE_SINGLEVAL_FILLFACTOR% full, just like v12/master. Leaving
a small amount of free space on pages that are packed full of
duplicates is always a good idea. Also, we no longer force page splits
to leave pages 2/3 full (only two large posting lists plus a high
key), which sometimes happened with v21. On balance, this change seems
to slightly improve space utilization.
In general, it's now unusual for retail insertions to get better space
utilization than CREATE INDEX -- in that sense normality/balance has
been restored in v22. Actually, I wrote the v22 changes by working
through a list of weird space utilization issues from my personal
notes. I'm pretty sure I've fixed all of those -- only nbtsplitloc.c's
single value strategy wants to split at a point that leaves a heap TID
in the new high key for the page, so that's the only thing we need to
worry about within nbtdedup.c.
* "deduplication" storage parameter now has psql completion.
I intend to push the datum_image_eq() preparatory patch soon. I will
also push a commit that makes _bt_keep_natts_fast() use
datum_image_eq() separately. Anybody have an opinion on that?
--
Peter Geoghegan
Attachments:
v22-0003-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v22-0003-DEBUG-Add-pageinspect-instrumentation.patchDownload
From 1d1fb340bf57f2e515a08af8ca8f22ba82fc9af9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v22 3/3] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 92 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 109 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
#include "postgres.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -252,9 +254,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid,
+ max_htid;
id = PageGetItemId(page, offset);
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxHeapTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v22-0002-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v22-0002-Add-deduplication-to-nbtree.patchDownload
From d3cca4d4cca643f6754c710dee11f869d5edb200 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v22 2/3] Add deduplication to nbtree
---
src/include/access/nbtree.h | 327 +++++++++--
src/include/access/nbtxlog.h | 68 ++-
src/include/access/rmgrlist.h | 2 +-
src/backend/access/common/reloptions.c | 11 +-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtdedup.c | 704 ++++++++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c | 327 +++++++++--
src/backend/access/nbtree/nbtpage.c | 209 ++++++-
src/backend/access/nbtree/nbtree.c | 174 +++++-
src/backend/access/nbtree/nbtsearch.c | 249 ++++++++-
src/backend/access/nbtree/nbtsort.c | 209 ++++++-
src/backend/access/nbtree/nbtsplitloc.c | 49 +-
src/backend/access/nbtree/nbtutils.c | 218 +++++++-
src/backend/access/nbtree/nbtxlog.c | 218 +++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 28 +-
src/bin/psql/tab-complete.c | 4 +-
contrib/amcheck/verify_nbtree.c | 177 ++++--
19 files changed, 2834 insertions(+), 219 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..afaa6b4bd8 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,39 @@
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor;
+ double vacuum_cleanup_index_scale_factor;
+ bool deduplication; /* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * By default deduplication is enabled for non unique indexes
+ * and disabled for unique ones
+ *
+ * XXX: Actually, we use deduplication everywhere for now. Re-review this
+ * decision later on.
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? true : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+ BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -102,11 +135,13 @@ typedef struct BTMetaPageData
uint32 btm_level; /* tree level of the root page */
BlockNumber btm_fastroot; /* current "fast" root location */
uint32 btm_fastlevel; /* tree level of the "fast" root page */
- /* remaining fields only valid when btm_version >= BTREE_NOVAC_VERSION */
+ /* These fields only valid when btm_version >= BTREE_NOVAC_VERSION */
TransactionId btm_oldest_btpo_xact; /* oldest btpo_xact among all deleted
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ /* This field only valid when btm_version >= FIXME */
+ bool btm_safededup; /* deduplication safe for index? */
} BTMetaPageData;
#define BTPageGetMeta(p) \
@@ -154,6 +189,26 @@ typedef struct BTMetaPageData
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page. This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound. A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples. (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area). The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable. There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+ (int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+ sizeof(ItemPointerData))
/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -234,8 +289,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -282,20 +336,104 @@ typedef struct BTMetaPageData
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
*
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID. Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple. A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used). Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ * t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid. These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use. Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +464,71 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -434,6 +603,11 @@ typedef BTStackData *BTStack;
* indexes whose version is >= version 4. It's convenient to keep this close
* by, rather than accessing the metapage repeatedly.
*
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use). This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
* anynullkeys indicates if any of the keys had NULL value when scankey was
* built from index tuple (note that already-truncated tuple key attributes
* set NULL as a placeholder key value, which also affects value of
@@ -469,6 +643,7 @@ typedef BTStackData *BTStack;
typedef struct BTScanInsertData
{
bool heapkeyspace;
+ bool safededup;
bool anynullkeys;
bool nextkey;
bool pivotsearch;
@@ -507,6 +682,13 @@ typedef struct BTInsertStateData
bool bounds_valid;
OffsetNumber low;
OffsetNumber stricthigh;
+
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
} BTInsertStateData;
typedef BTInsertStateData *BTInsertState;
@@ -534,7 +716,10 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -567,6 +752,12 @@ typedef struct BTScanPosData
*/
int nextTupleOffset;
+ /*
+ * Posting list tuples use postingTupleOffset to store the current
+ * location of the tuple that is returned multiple times.
+ */
+ int postingTupleOffset;
+
/*
* The items array is always ordered in index order (ie, increasing
* indexoffset). When scanning backwards it is convenient to fill the
@@ -578,7 +769,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxBTreeIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -680,6 +871,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* Limit on size of final tuple */
+ IndexTuple newitem;
+ bool checkingunique; /* Use unique index strategy? */
+ OffsetNumber skippedbase; /* First offset skipped by checkingunique */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
/*
* Constant definition for progress reporting. Phase numbers must match
* btbuildphasename.
@@ -725,6 +967,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+ bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff);
+
/*
* prototypes for functions in nbtinsert.c
*/
@@ -743,7 +1001,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1010,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1022,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1074,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_safededup;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d8790ad7a3..d69402c08d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1510,8 +1519,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
nbtcompare.o \
+ nbtdedup.o \
nbtinsert.o \
nbtpage.o \
nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..a9f9cd30f5
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,704 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ * Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+ int count = 0;
+ bool singlevalue = false;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ state->skippedbase = InvalidOffsetNumber;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Determine if a "single value" strategy page split is likely to occur
+ * shortly after deduplication finishes. It should be possible for the
+ * single value split to find a split point that packs the left half of
+ * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+ */
+ if (!checkingunique)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ /*
+ * Use different strategy if future page split likely to need to
+ * use "single value" strategy
+ */
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ singlevalue = true;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+ offnum = minoff;
+retry:
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state. The next iteration
+ * will also end up here if it's possible to merge the next tuple
+ * into the same pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed BTMaxItemSize()
+ * limit).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page. Otherwise, reset
+ * the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ count++;
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication in case
+ * the remaining items are about to be marked dead within
+ * _bt_check_unique().
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /*
+ * Consider special steps when a future page split of the leaf
+ * page is likely to occur using nbtsplitloc.c's "single value"
+ * strategy
+ */
+ if (singlevalue)
+ {
+ /*
+ * Adjust maxitemsize so that there isn't a third and final
+ * 1/3 of a page width tuple that fills the page to capacity.
+ * The third tuple produced should be smaller than the first
+ * two by an amount equal to the free space that nbtsplitloc.c
+ * is likely to want to leave behind when the page it split.
+ * When there are 3 posting lists on the page, then we end
+ * deduplication. Remaining tuples on the page can be
+ * deduplicated later, when they're on the new right sibling
+ * of this page, and the new sibling page needs to be split in
+ * turn.
+ *
+ * Note that it doesn't matter if there are items on the page
+ * that were already 1/3 of a page during current pass;
+ * they'll still count as the first two posting list tuples.
+ */
+ if (count == 2)
+ {
+ Size totalspace;
+
+ totalspace = PageGetPageSize(page) - SizeOfPageHeaderData -
+ MAXALIGN(sizeof(BTPageOpaqueData));
+ state->maxitemsize -= totalspace *
+ ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+ }
+ else if (count == 3)
+ break;
+ }
+
+ /*
+ * Next iteration starts immediately after base tuple offset (this
+ * will be the next offset on the page when we didn't modify the
+ * page)
+ */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ {
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+ count++;
+ }
+
+ if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+ {
+ /*
+ * Didn't free enough space for new item in first checkingunique pass.
+ * Try making a second pass over the page, this time starting from the
+ * first candidate posting list base offset that was skipped over in
+ * the first pass (only do a second pass when this actually happened).
+ *
+ * The second pass over the page may deduplicate items that were
+ * initially passed over due to concerns about limiting the
+ * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that the second pass will still stop deduplicating as soon as
+ * enough space has been freed to avoid an immediate page split.
+ */
+ Assert(state->checkingunique);
+ offnum = state->skippedbase;
+
+ state->checkingunique = false;
+ state->skippedbase = InvalidOffsetNumber;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique &&
+ (BTreeTupleIsPosting(state->base) || nhtids > 1))
+ {
+ /* May begin here if second pass over page is required */
+ if (state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ return false;
+ }
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique || state->nitems == 1 ||
+ state->nhtids == state->nitems);
+ if (state->checkingunique)
+ {
+ minimum = 3;
+ /* May begin here if second pass over page is required */
+ if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ }
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = _bt_form_posting(state->base, state->htids, state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order. Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'. Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps. (This approach avoids any explicit WAL-logging of a
+ * posting list tuple. This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ OffsetNumber postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ Assert(BTreeTupleIsPosting(oposting));
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID)
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /* Fill the gap with the TID of the new item */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /* Copy original posting list's rightmost TID into new item */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+ return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..3103d8eb56 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
insertstate.itup_key = itup_key;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
/*
* It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ bool inposting = false;
+ bool prev_all_dead = true;
+ int curposti = 0;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/*
* Scan over all equal tuples, looking for live conflicts.
+ *
+ * Note that each iteration of the loop processes one heap TID, not one
+ * index tuple. The page offset number won't be advanced for iterations
+ * which process heap TIDs from posting list tuples until the last such
+ * heap TID for the posting list (curposti will be advanced instead).
*/
Assert(!insertstate->bounds_valid || insertstate->low == offset);
Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ /*
+ * decide if this is the first heap TID in tuple we'll
+ * process, or if we should continue to process current
+ * posting list
+ */
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ inposting = false;
+ }
+ else if (!inposting)
+ {
+ /* First heap TID in posting list */
+ inposting = true;
+ prev_all_dead = true;
+ curposti = 0;
+ }
+
+ if (inposting)
+ htid = *BTreeTupleGetPostingN(curitup, curposti);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && (!inposting ||
+ (prev_all_dead &&
+ curposti == BTreeTupleGetNPosting(curitup) - 1)))
{
/*
- * The conflicting tuple (or whole HOT chain) is dead to
- * everyone, so we may as well mark the index entry
- * killed.
+ * The conflicting tuple (or all HOT chains pointed to by
+ * all posting list TIDs) is dead to everyone, so mark the
+ * index entry killed.
*/
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+
+ /*
+ * Remember if posting list tuple has even a single HOT chain
+ * whose members are not all dead
+ */
+ if (!all_dead && inposting)
+ prev_all_dead = false;
}
}
- /*
- * Advance to next tuple to continue checking.
- */
- if (offset < maxoff)
+ if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+ {
+ /* Advance to next TID in same posting list */
+ curposti++;
+ continue;
+ }
+ else if (offset < maxoff)
+ {
+ /* Advance to next tuple */
+ curposti = 0;
+ inposting = false;
offset = OffsetNumberNext(offset);
+ }
else
{
int highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
+ curposti = 0;
+ inposting = false;
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->safededup &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->safededup);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,43 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple on a leaf page. Prepare to
+ * split an existing posting list by swapping new item's heap TID with
+ * the rightmost heap TID from original posting list, and generating a
+ * new version of the posting list that has new item's heap TID.
+ *
+ * Posting list splits work by modifying the overlapping posting list
+ * as part of the same atomic operation that inserts the "new item".
+ * The space accounting is kept simple, since it does not need to
+ * consider posting list splits at all (this is particularly important
+ * for the case where we also have to split the page). Overwriting
+ * the posting list with its post-split version is treated as an extra
+ * step in either the insert or page split critical section.
+ */
+ Assert(P_ISLEAF(lpageop));
+ Assert(!ItemIdIsDead(itemid));
+ Assert(postingoff > 0);
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID for xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+ Assert(BTreeTupleGetNPosting(nposting) ==
+ BTreeTupleGetNPosting(oposting));
+ /* Alter offset so that it goes after existing posting list */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1137,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1217,15 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ if (nposting)
+ {
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+ }
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1267,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1296,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1305,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1359,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1381,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1421,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1469,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1543,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1579,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1689,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1874,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1898,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in the record, though.
+ *
+ * The details are sometimes slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery doesn't need to process a posting list split at all).
+ * Otherwise, we set postingoff and log orignewitem instead of
+ * newitem, despite having actually inserted newitem. Recovery must
+ * reconstruct nposting and newitem by calling _bt_swap_posting().
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2096,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2452,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2304,6 +2567,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..ca25e856e7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = safededup;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,7 @@ _bt_upgrademetapage(Page page)
metad->btm_version = BTREE_NOVAC_VERSION;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = false;
/* Adjust pd_lower (see _bt_initmetapage() for details) */
((PageHeader) page)->pd_lower =
@@ -213,6 +222,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -394,6 +404,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -683,6 +694,59 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * FIXME: Think some more about pg_upgrade'd !heapkeyspace indexes
+ * here, and the need for a version bump to go with new metapage
+ * field. I think that we may need to bump the major version because
+ * even v4 indexes (those built on Postgres 12) will have garbage in
+ * the new safedup field. Creating a v5 would mean "new field can be
+ * trusted to not be garbage".
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_safededup;;
+ }
+
+ /* Cache the metapage data for next time */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_safededup;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -983,14 +1047,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1122,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1137,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1158,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1269,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2268,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxBTreeIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = _bt_form_posting(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..23621cdd37 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res > 0)
+ low = mid + 1;
+ else if (res < 0)
+ high = mid;
+ else
+ return mid;
+ }
+
+ /* Exact match not found */
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
+ inskey.safededup = false; /* unused */
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxBTreeIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1773,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save base IndexTuple (truncate posting list) */
+ IndexTuple base;
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+ memcpy(base, itup, itupsz);
+ /* Defensively reduce work area index tuple header size */
+ base->t_info &= ~INDEX_SIZE_MASK;
+ base->t_info |= itupsz;
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b5f0857598..29cc49e4b9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
BlockNumber btps_blkno; /* block # to write this page at */
IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */
OffsetNumber btps_lastoff; /* last item offset loaded */
+ Size btps_lastextra; /* last item's extra posting list space */
uint32 btps_level; /* tree level (0 = leaf) */
Size btps_full; /* "full" if less than this much free space */
struct BTPageState *btps_next; /* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
- IndexTuple itup);
+ IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
state->btps_lowkey = NULL;
/* initialize lastoff so first item goes into P_FIRSTKEY */
state->btps_lastoff = P_HIKEY;
+ state->btps_lastextra = 0;
state->btps_level = level;
/* set "full" threshold based on level. See notes at head of file. */
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
- BTREE_DEFAULT_FILLFACTOR);
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+ BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -790,7 +795,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -822,14 +828,27 @@ _bt_sortaddtup(Page page,
* the truncated high key at offset 1.
*
* 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any. This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off. Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page). Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
*----------
*/
static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+ Size truncextra)
{
Page npage;
BlockNumber nblkno;
OffsetNumber last_off;
+ Size last_truncextra;
Size pgspc;
Size itupsz;
bool isleaf;
@@ -843,6 +862,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
npage = state->btps_page;
nblkno = state->btps_blkno;
last_off = state->btps_lastoff;
+ last_truncextra = state->btps_lastextra;
+ state->btps_lastextra = truncextra;
pgspc = PageGetFreeSpace(npage);
itupsz = IndexTupleSize(itup);
@@ -884,10 +905,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* page. Disregard fillfactor and insert on "full" current page if we
* don't have the minimum number of items yet. (Note that we deliberately
* assume that suffix truncation neither enlarges nor shrinks new high key
- * when applying soft limit.)
+ * when applying soft limit, except when last tuple had a posting list.)
*/
if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
- (pgspc < state->btps_full && last_off > P_FIRSTKEY))
+ (pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
{
/*
* Finish off the page and write it out.
@@ -945,11 +966,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* We don't try to bias our choice of split point to make it more
* likely that _bt_truncate() can truncate away more attributes,
* whereas the split point used within _bt_split() is chosen much
- * more delicately. Suffix truncation is mostly useful because it
- * improves space utilization for workloads with random
- * insertions. It doesn't seem worthwhile to add logic for
- * choosing a split point here for a benefit that is bound to be
- * much smaller.
+ * more delicately. On the other hand, non-unique index builds
+ * usually deduplicate, which often results in every "physical"
+ * tuple on the page having distinct key values. When that
+ * happens, _bt_truncate() will never need to include a heap TID
+ * in the new high key.
*
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
@@ -984,7 +1005,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
- _bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+ _bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
pfree(state->btps_lowkey);
/*
@@ -1046,6 +1067,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+ Size truncextra;
+
+ Assert(dstate->nitems > 0);
+ truncextra = 0;
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = _bt_form_posting(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ /* Determine size of posting list */
+ truncextra = IndexTupleSize(final) -
+ BTreeTupleGetPostingOffset(final);
+ }
+
+ _bt_buildadd(wstate, state, final, truncextra);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1091,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST(opaque));
BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
- _bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+ _bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
pfree(s->btps_lowkey);
s->btps_lowkey = NULL;
}
@@ -1112,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+ _bt_initmetapage(metapage, rootblkno, rootlevel,
+ wstate->inskey->safededup);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1133,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->safededup &&
+ BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1229,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (load1)
{
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
itup = tuplesort_getindextuple(btspool->sortstate, true);
}
else
{
- _bt_buildadd(wstate, state, itup2);
+ _bt_buildadd(wstate, state, itup2, 0);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
}
@@ -1244,9 +1311,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->skippedbase = InvalidOffsetNumber;
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to the size of the free
+ * space we want to leave behind on the page, plus space for
+ * final item's line pointer (but make sure that posting list
+ * tuple size won't exceed the generic 1/3 of a page limit).
+ *
+ * This is more conservative than the approach taken in the
+ * retail insert path, but it allows us to get most of the
+ * space savings deduplication provides without noticeably
+ * impacting how much free space is left behind on each leaf
+ * page.
+ */
+ dstate->maxitemsize =
+ Min(BTMaxItemSize(state->btps_page),
+ MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+ /* Minimum posting tuple size used here is arbitrary: */
+ dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * maxitemsize limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending
+ * posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * maxitemsize limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
@@ -1254,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (state == NULL)
state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
/* Report progress */
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..7758d74101 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -167,7 +167,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
newitemisfirstonright = (firstoldonright == state->newitemoff
&& !newitemonleft);
+ /*
+ * FIXME: Accessing every single tuple like this adds cycles to cases that
+ * cannot possibly benefit (i.e. cases where we know that there cannot be
+ * posting lists). Maybe we should add a way to not bother when we are
+ * certain that this is the case.
+ *
+ * We could either have _bt_split() pass us a flag, or invent a page flag
+ * that indicates that the page might have posting lists, as an
+ * optimization. There is no shortage of btpo_flags bits for stuff like
+ * this.
+ */
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 6a3008dd48..6fec8cb745 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -98,8 +98,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +106,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+ key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+ _bt_safededup(rel);
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1373,6 +1384,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1534,6 +1546,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1773,10 +1786,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2014,7 +2052,31 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ relopt_value *options;
+ BtreeOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL,
+ offsetof(BtreeOptions, deduplication)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ return NULL;
+
+ rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+ return (bytea *) rdopts;
}
/*
@@ -2127,6 +2189,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2143,6 +2223,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2232,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2157,7 +2257,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2175,6 +2276,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2187,7 +2289,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2300,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2213,7 +2318,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2222,7 +2327,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2303,15 +2409,22 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted). Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
*
* These issues must be acceptable to callers, typically because they're only
* concerned about making suffix truncation as effective as possible without
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure. See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2337,7 +2450,7 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
break;
if (!isNull1 &&
- !datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+ !datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
break;
keepnatts++;
@@ -2389,22 +2502,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2448,12 +2569,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2479,7 +2600,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2549,11 +2674,44 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds.
+ *
+ * Note: This does not account for pg_uggrade'd !heapkeyspace indexes
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+ /* INCLUDE indexes don't support deduplication */
+ if (IndexRelationGetNumberOfAttributes(index) !=
+ IndexRelationGetNumberOfKeyAttributes(index))
+ return false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+
+ /* NUMERIC btree opfamily OID is 1988 */
+ if (opfamily == 1988)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..27694246e2 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,8 +21,11 @@
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
#include "miscadmin.h"
+static MemoryContext opCtx; /* working memory for operations */
+
/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_safededup = xlrec->btm_safededup;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_swap_posting() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_swap_posting(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->skippedbase = InvalidOffsetNumber;
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +988,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2b1e3cda4a..bf4a27ab75 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX <foo> SET|RESET ( */
else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
COMPLETE_WITH("fillfactor",
- "vacuum_cleanup_index_scale_factor", /* BTREE */
+ "vacuum_cleanup_index_scale_factor", "deduplication", /* BTREE */
"fastupdate", "gin_pending_list_limit", /* GIN */
"buffering", /* GiST */
"pages_per_range", "autosummarize" /* BRIN */
);
else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
COMPLETE_WITH("fillfactor =",
- "vacuum_cleanup_index_scale_factor =", /* BTREE */
+ "vacuum_cleanup_index_scale_factor =", "deduplication =", /* BTREE */
"fastupdate =", "gin_pending_list_limit =", /* GIN */
"buffering =", /* GiST */
"pages_per_range =", "autosummarize =" /* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d678ed..ebbbae137a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, HeapTuple htup,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
}
/*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
*/
static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ Assert(state->heapkeyspace);
+
+ /*
+ * Make sure that tuple type (pivot vs non-pivot) matches caller's
+ * expectation
+ */
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
--
2.17.1
v22-0001-Teach-datum_image_eq-about-cstring-datums.patchapplication/octet-stream; name=v22-0001-Teach-datum_image_eq-about-cstring-datums.patchDownload
From 2697341e50a43ee544f496189a6180eff9713a78 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 4 Nov 2019 09:07:13 -0800
Subject: [PATCH v22 1/3] Teach datum_image_eq() about cstring datums.
An upcoming patch to add deduplication to nbtree indexes needs to be
able to use datum_image_eq() as a drop-in replacement for opclass
equality in certain contexts. This includes comparisons of TOASTable
datatypes such as text (at least when deterministic collations are in
use), and cstring datums in system catalog indexes. cstring is used as
the storage type of "name" columns when indexed by nbtree, despite the
fact that cstring is a pseudo-type.
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
---
src/backend/utils/adt/datum.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 73703efe05..b20d0640ea 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -263,6 +263,8 @@ datumIsEqual(Datum value1, Datum value2, bool typByVal, int typLen)
bool
datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
{
+ Size len1,
+ len2;
bool result = true;
if (typByVal)
@@ -277,9 +279,6 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
}
else if (typLen == -1)
{
- Size len1,
- len2;
-
len1 = toast_raw_datum_size(value1);
len2 = toast_raw_datum_size(value2);
/* No need to de-toast if lengths don't match. */
@@ -304,6 +303,20 @@ datum_image_eq(Datum value1, Datum value2, bool typByVal, int typLen)
pfree(arg2val);
}
}
+ else if (typLen == -2)
+ {
+ char *s1,
+ *s2;
+
+ /* Compare cstring datums */
+ s1 = DatumGetCString(value1);
+ s2 = DatumGetCString(value2);
+ len1 = strlen(s1) + 1;
+ len2 = strlen(s2) + 1;
+ if (len1 != len2)
+ return false;
+ result = (memcmp(s1, s2, len1) == 0);
+ }
else
elog(ERROR, "unexpected typLen: %d", typLen);
--
2.17.1
On Fri, Nov 8, 2019 at 10:35 AM Peter Geoghegan <pg@bowt.ie> wrote:
There is more bitrot, so I attach v22.
The patch has stopped applying once again, so I attach v23.
One reason for the bitrot is that I pushed preparatory commits,
including today's "Make _bt_keep_natts_fast() use datum_image_eq()"
commit. Good to get that out of the way.
Other changes:
* Decided to go back to turning deduplication on by default with
non-unique indexes, and off by default using unique indexes.
The unique index stuff was regressed enough with INSERT-heavy
workloads that I was put off, despite my initial enthusiasm for
enabling deduplication everywhere.
* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.
I realized that it would be impossible to provide a way to disable
deduplication in system catalog indexes if it was enabled at all. The
reason for this is simple: in general, it's not possible to set
storage parameters for system catalog indexes.
While I think that deduplication should work with system catalog
indexes on general principle, this is about an existing limitation.
Deduplication in catalog indexes can be revisited if and when somebody
figures out a way to make storage parameters work with system catalog
indexes.
* Basic user documentation -- this still needs work, but the basic
shape is now in place. I think that we should outline how the feature
works by describing the internals, including details of the data
structures. This provides guidance to users on when they should
disable or enable the feature.
This is discussed in the existing chapter on B-Tree internals. This
felt natural because it's similar to how GIN explains its compression
related features -- the discussion of the storage parameters in the
CREATE INDEX page of the docs links to a description of GIN internals
from "66.4. Implementation [of GIN]".
* nbtdedup.c "single value" strategy stuff now considers the
contribution of the page high key when considering how to deduplicate
such that nbtsplitloc.c's "single value" strategy has a usable split
point that helps it to hit its target free space. Not a very important
detail. It's nice to be consistent with the corresponding code within
nbtsplitloc.c.
* Worked through all remaining XXX/TODO/FIXME comments, except one:
The one that talks about the need for opclass infrastructure to deal
with cases like btree/numeric_ops, or text with a nondeterministic
collation.
The user docs now reference the BITWISE opclass stuff that we're
discussing over on the other thread. That's the only really notable
open item now IMV.
--
Peter Geoghegan
Attachments:
v23-0002-DEBUG-Add-pageinspect-instrumentation.patchapplication/octet-stream; name=v23-0002-DEBUG-Add-pageinspect-instrumentation.patchDownload
From f30417cc9d917d85d5a36d64984bab0096034dc9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v23 2/2] DEBUG: Add pageinspect instrumentation.
Have pageinspect display user-visible attribute values, heap TID, max
heap TID, and the number of TIDs in a tuple (can be > 1 in the case of
posting list tuples). Also adds a column that shows whether or not the
LP_DEAD bit has been set.
This patch is not proposed for inclusion in PostgreSQL; it's included
for the convenience of reviewers.
The following query can be used with this hacked pageinspect, which
visualizes the internal pages:
"""
with recursive index_details as (
select
'my_test_index'::text idx
),
size_in_pages_index as (
select
(pg_relation_size(idx::regclass) / (2^13))::int4 size_pages
from
index_details
),
page_stats as (
select
index_details.*,
stats.*
from
index_details,
size_in_pages_index,
lateral (select i from generate_series(1, size_pages - 1) i) series,
lateral (select * from bt_page_stats(idx, i)) stats),
internal_page_stats as (
select
*
from
page_stats
where
type != 'l'),
meta_stats as (
select
*
from
index_details s,
lateral (select * from bt_metap(s.idx)) meta),
internal_items as (
select
*
from
internal_page_stats
order by
btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as (
select
1,
blkno,
btpo
from
internal_items
where
btpo_prev = 0
and btpo = (select level from meta_stats)
union
select
case when level = btpo then o.item + 1 else 1 end,
blkno,
btpo
from
internal_items i,
ordered_internal_items o
where
i.btpo_prev = o.blk or (btpo_prev = 0 and btpo = o.level - 1)
)
select
--idx,
btpo as level,
item as l_item,
blkno,
--btpo_prev,
--btpo_next,
btpo_flags,
type,
live_items,
dead_items,
avg_item_size,
page_size,
free_size,
-- Only non-rightmost pages have high key. Show heap TID for both pivot and non-pivot tuples here.
case when btpo_next != 0 then (select data || coalesce(', (htid)=(''' || htid || ''')', '')
from bt_page_items(idx, blkno) where itemoffset = 1) end as highkey
from
ordered_internal_items o
join internal_items i on o.blk = i.blkno
order by btpo desc, item;
"""
---
contrib/pageinspect/btreefuncs.c | 92 ++++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +-
contrib/pageinspect/pageinspect--1.6--1.7.sql | 25 +++++
3 files changed, 109 insertions(+), 14 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..435e71ae22 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
#include "postgres.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -241,6 +242,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
};
@@ -252,9 +254,9 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset, Relation rel)
{
- char *values[6];
+ char *values[10];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
@@ -263,6 +265,8 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
int dlen;
char *dump;
char *ptr;
+ ItemPointer min_htid,
+ max_htid;
id = PageGetItemId(page, offset);
@@ -281,16 +285,77 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
- dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ values[j++] = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ values[j++] = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
+ if (rel && !_bt_heapkeyspace(rel))
+ {
+ min_htid = NULL;
+ max_htid = NULL;
+ }
+ else
+ {
+ min_htid = BTreeTupleGetHeapTID(itup);
+ if (BTreeTupleIsPosting(itup))
+ max_htid = BTreeTupleGetMaxHeapTID(itup);
+ else
+ max_htid = NULL;
+ }
+
+ if (min_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(min_htid),
+ ItemPointerGetOffsetNumberNoCheck(min_htid));
+ else
+ values[j++] = NULL;
+
+ if (max_htid)
+ values[j++] = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(max_htid),
+ ItemPointerGetOffsetNumberNoCheck(max_htid));
+ else
+ values[j++] = NULL;
+
+ if (min_htid == NULL)
+ values[j++] = psprintf("0");
+ else if (!BTreeTupleIsPosting(itup))
+ values[j++] = psprintf("1");
+ else
+ values[j++] = psprintf("%d", (int) BTreeTupleGetNPosting(itup));
+
+ if (!ItemIdIsDead(id))
+ values[j++] = psprintf("f");
+ else
+ values[j++] = psprintf("t");
tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
@@ -364,11 +429,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -395,12 +460,13 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, uargs->rel);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -480,7 +546,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs->page, uargs->offset, NULL);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..0f6dccaadc 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,11 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
+htid | (0,1)
+max_htid |
+nheap_tids | 1
+isdead | f
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
diff --git a/contrib/pageinspect/pageinspect--1.6--1.7.sql b/contrib/pageinspect/pageinspect--1.6--1.7.sql
index 2433a21af2..00473da938 100644
--- a/contrib/pageinspect/pageinspect--1.6--1.7.sql
+++ b/contrib/pageinspect/pageinspect--1.6--1.7.sql
@@ -24,3 +24,28 @@ CREATE FUNCTION bt_metap(IN relname text,
OUT last_cleanup_num_tuples real)
AS 'MODULE_PATHNAME', 'bt_metap'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items()
+--
+DROP FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT htid tid,
+ OUT max_htid tid,
+ OUT nheap_tids int4,
+ OUT isdead boolean)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
--
2.17.1
v23-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v23-0001-Add-deduplication-to-nbtree.patchDownload
From 8f059dc694460832417ed70e512bdef274ff84a7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v23 1/2] Add deduplication to nbtree
---
src/include/access/nbtree.h | 328 +++++++++--
src/include/access/nbtxlog.h | 68 ++-
src/include/access/rmgrlist.h | 2 +-
src/backend/access/common/reloptions.c | 11 +-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtdedup.c | 710 ++++++++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c | 321 +++++++++--
src/backend/access/nbtree/nbtpage.c | 211 ++++++-
src/backend/access/nbtree/nbtree.c | 174 +++++-
src/backend/access/nbtree/nbtsearch.c | 249 ++++++++-
src/backend/access/nbtree/nbtsort.c | 209 ++++++-
src/backend/access/nbtree/nbtsplitloc.c | 38 +-
src/backend/access/nbtree/nbtutils.c | 218 +++++++-
src/backend/access/nbtree/nbtxlog.c | 218 +++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 28 +-
src/bin/psql/tab-complete.c | 4 +-
contrib/amcheck/verify_nbtree.c | 177 ++++--
doc/src/sgml/btree.sgml | 48 +-
doc/src/sgml/charset.sgml | 9 +-
doc/src/sgml/ref/create_index.sgml | 43 +-
doc/src/sgml/ref/reindex.sgml | 5 +-
23 files changed, 2921 insertions(+), 229 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..d59d1dd574 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,36 @@
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor; /* leaf fillfactor */
+ double vacuum_cleanup_index_scale_factor;
+ bool deduplication; /* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * Deduplication is enabled for non unique indexes and disabled for unique
+ * indexes by default
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+ BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -107,6 +137,7 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_safededup; /* deduplication known to be safe? */
} BTMetaPageData;
#define BTPageGetMeta(p) \
@@ -114,7 +145,8 @@ typedef struct BTMetaPageData
/*
* The current Btree version is 4. That's what you'll get when you create
- * a new index.
+ * a new index. The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
*
* Btree version 3 was used in PostgreSQL v11. It is mostly the same as
* version 4, but heap TIDs were not part of the keyspace. Index tuples
@@ -131,8 +163,8 @@ typedef struct BTMetaPageData
#define BTREE_METAPAGE 0 /* first page is meta */
#define BTREE_MAGIC 0x053162 /* magic number in metapage */
#define BTREE_VERSION 4 /* current version number */
-#define BTREE_MIN_VERSION 2 /* minimal supported version number */
-#define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */
+#define BTREE_MIN_VERSION 2 /* minimum supported version */
+#define BTREE_NOVAC_VERSION 3 /* version with all meta fields set */
/*
* Maximum size of a btree index entry, including its tuple header.
@@ -154,6 +186,26 @@ typedef struct BTMetaPageData
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page. This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound. A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples. (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area). The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable. There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+ (int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+ sizeof(ItemPointerData))
/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -234,8 +286,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -282,20 +333,104 @@ typedef struct BTMetaPageData
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
*
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID. Postgres 13 introduced a new
+ * non-pivot tuple format in order to fold together multiple equal and
+ * equivalent non-pivot tuples into a single logically equivalent, space
+ * efficient representation - a posting list tuple. A posting list is an
+ * array of ItemPointerData elements (there must be at least two elements
+ * when the posting list tuple format is used). Posting list tuples are
+ * created dynamically by deduplication, at the point where we'd otherwise
+ * have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ * t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid. These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use. Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +461,71 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+ return &(itup->t_tid);
+}
+
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -434,6 +600,11 @@ typedef BTStackData *BTStack;
* indexes whose version is >= version 4. It's convenient to keep this close
* by, rather than accessing the metapage repeatedly.
*
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use). This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
* anynullkeys indicates if any of the keys had NULL value when scankey was
* built from index tuple (note that already-truncated tuple key attributes
* set NULL as a placeholder key value, which also affects value of
@@ -469,6 +640,7 @@ typedef BTStackData *BTStack;
typedef struct BTScanInsertData
{
bool heapkeyspace;
+ bool safededup;
bool anynullkeys;
bool nextkey;
bool pivotsearch;
@@ -507,6 +679,13 @@ typedef struct BTInsertStateData
bool bounds_valid;
OffsetNumber low;
OffsetNumber stricthigh;
+
+ /*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
} BTInsertStateData;
typedef BTInsertStateData *BTInsertState;
@@ -534,7 +713,10 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -567,6 +749,12 @@ typedef struct BTScanPosData
*/
int nextTupleOffset;
+ /*
+ * Posting list tuples use postingTupleOffset to store the current
+ * location of the tuple that is returned multiple times.
+ */
+ int postingTupleOffset;
+
/*
* The items array is always ordered in index order (ie, increasing
* indexoffset). When scanning backwards it is convenient to fill the
@@ -578,7 +766,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxBTreeIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -680,6 +868,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* Limit on size of final tuple */
+ IndexTuple newitem;
+ bool checkingunique; /* Use unique index strategy? */
+ OffsetNumber skippedbase; /* First offset skipped by checkingunique */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
/*
* Constant definition for progress reporting. Phase numbers must match
* btbuildphasename.
@@ -725,6 +964,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+ bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ int postingoff);
+
/*
* prototypes for functions in nbtinsert.c
*/
@@ -743,7 +998,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1007,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1019,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1071,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_safededup;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d8790ad7a3..d69402c08d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1510,8 +1519,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
nbtcompare.o \
+ nbtdedup.o \
nbtinsert.o \
nbtpage.o \
nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..dde1d68d6f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ * Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+ int count = 0;
+ bool singlevalue = false;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ state->skippedbase = InvalidOffsetNumber;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Determine if a "single value" strategy page split is likely to occur
+ * shortly after deduplication finishes. It should be possible for the
+ * single value split to find a split point that packs the left half of
+ * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+ */
+ if (!checkingunique)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ /*
+ * Use different strategy if future page split likely to need to
+ * use "single value" strategy
+ */
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ singlevalue = true;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+ offnum = minoff;
+retry:
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state. The next iteration
+ * will also end up here if it's possible to merge the next tuple
+ * into the same pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed BTMaxItemSize()
+ * limit).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page. Otherwise, reset
+ * the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ count++;
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication in case
+ * the remaining items are about to be marked dead within
+ * _bt_check_unique().
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /*
+ * Consider special steps when a future page split of the leaf
+ * page is likely to occur using nbtsplitloc.c's "single value"
+ * strategy
+ */
+ if (singlevalue)
+ {
+ /*
+ * Adjust maxitemsize so that there isn't a third and final
+ * 1/3 of a page width tuple that fills the page to capacity.
+ * The third tuple produced should be smaller than the first
+ * two by an amount equal to the free space that nbtsplitloc.c
+ * is likely to want to leave behind when the page it split.
+ * When there are 3 posting lists on the page, then we end
+ * deduplication. Remaining tuples on the page can be
+ * deduplicated later, when they're on the new right sibling
+ * of this page, and the new sibling page needs to be split in
+ * turn.
+ *
+ * Note that it doesn't matter if there are items on the page
+ * that were already 1/3 of a page during current pass;
+ * they'll still count as the first two posting list tuples.
+ */
+ if (count == 2)
+ {
+ Size leftfree;
+
+ /* This calculation needs to match nbtsplitloc.c */
+ leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+ MAXALIGN(sizeof(BTPageOpaqueData));
+ /* Subtract predicted size of new high key */
+ leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+ /*
+ * Reduce maxitemsize by an amount equal to target free
+ * space on left half of page
+ */
+ state->maxitemsize -= leftfree *
+ ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+ }
+ else if (count == 3)
+ break;
+ }
+
+ /*
+ * Next iteration starts immediately after base tuple offset (this
+ * will be the next offset on the page when we didn't modify the
+ * page)
+ */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ {
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+ count++;
+ }
+
+ if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+ {
+ /*
+ * Didn't free enough space for new item in first checkingunique pass.
+ * Try making a second pass over the page, this time starting from the
+ * first candidate posting list base offset that was skipped over in
+ * the first pass (only do a second pass when this actually happened).
+ *
+ * The second pass over the page may deduplicate items that were
+ * initially passed over due to concerns about limiting the
+ * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that the second pass will still stop deduplicating as soon as
+ * enough space has been freed to avoid an immediate page split.
+ */
+ Assert(state->checkingunique);
+ offnum = state->skippedbase;
+
+ state->checkingunique = false;
+ state->skippedbase = InvalidOffsetNumber;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique &&
+ (BTreeTupleIsPosting(state->base) || nhtids > 1))
+ {
+ /* May begin here if second pass over page is required */
+ if (state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ return false;
+ }
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique || state->nitems == 1 ||
+ state->nhtids == state->nitems);
+ if (state->checkingunique)
+ {
+ minimum = 3;
+ /* May begin here if second pass over page is required */
+ if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ }
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = _bt_form_posting(state->base, state->htids, state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order. Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'. Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps. (This approach avoids any explicit WAL-logging of a
+ * posting list tuple. This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff > 0 && postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID)
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /* Fill the gap with the TID of the new item */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /* Copy original posting list's rightmost TID into new item */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+ return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c3df..e5f6023ad0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
insertstate.itup_key = itup_key;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
/*
* It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ bool inposting = false;
+ bool prev_all_dead = true;
+ int curposti = 0;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/*
* Scan over all equal tuples, looking for live conflicts.
+ *
+ * Note that each iteration of the loop processes one heap TID, not one
+ * index tuple. The page offset number won't be advanced for iterations
+ * which process heap TIDs from posting list tuples until the last such
+ * heap TID for the posting list (curposti will be advanced instead).
*/
Assert(!insertstate->bounds_valid || insertstate->low == offset);
Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ /*
+ * decide if this is the first heap TID in tuple we'll
+ * process, or if we should continue to process current
+ * posting list
+ */
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ inposting = false;
+ }
+ else if (!inposting)
+ {
+ /* First heap TID in posting list */
+ inposting = true;
+ prev_all_dead = true;
+ curposti = 0;
+ }
+
+ if (inposting)
+ htid = *BTreeTupleGetPostingN(curitup, curposti);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && (!inposting ||
+ (prev_all_dead &&
+ curposti == BTreeTupleGetNPosting(curitup) - 1)))
{
/*
- * The conflicting tuple (or whole HOT chain) is dead to
- * everyone, so we may as well mark the index entry
- * killed.
+ * The conflicting tuple (or all HOT chains pointed to by
+ * all posting list TIDs) is dead to everyone, so mark the
+ * index entry killed.
*/
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+
+ /*
+ * Remember if posting list tuple has even a single HOT chain
+ * whose members are not all dead
+ */
+ if (!all_dead && inposting)
+ prev_all_dead = false;
}
}
- /*
- * Advance to next tuple to continue checking.
- */
- if (offset < maxoff)
+ if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+ {
+ /* Advance to next TID in same posting list */
+ curposti++;
+ continue;
+ }
+ else if (offset < maxoff)
+ {
+ /* Advance to next tuple */
+ curposti = 0;
+ inposting = false;
offset = OffsetNumberNext(offset);
+ }
else
{
int highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
+ curposti = 0;
+ inposting = false;
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->safededup &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->safededup);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,39 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple on a leaf page. Prepare to
+ * split an existing posting list by swapping new item's heap TID with
+ * the rightmost heap TID from original posting list, and generating a
+ * new version of the posting list that has new item's heap TID.
+ *
+ * Posting list splits work by modifying the overlapping posting list
+ * as part of the same atomic operation that inserts the "new item".
+ * The space accounting is kept simple, since it does not need to
+ * consider posting list splits at all (this is particularly important
+ * for the case where we also have to split the page). Overwriting
+ * the posting list with its post-split version is treated as an extra
+ * step in either the insert or page split critical section.
+ */
+ Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID for xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+ /* Alter offset so that it goes after existing posting list */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1133,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1213,13 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ if (nposting)
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1261,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1290,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1299,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1353,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1375,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1415,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1463,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1537,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1573,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1683,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1868,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1892,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in the record, though.
+ *
+ * The details are sometimes slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery doesn't need to process a posting list split at all).
+ * Otherwise, we set postingoff and log orignewitem instead of
+ * newitem, despite having actually inserted newitem. Recovery must
+ * reconstruct nposting and newitem by calling _bt_swap_posting().
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2090,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2446,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2304,6 +2561,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..77f443f7a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = safededup;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
metad->btm_version = BTREE_NOVAC_VERSION;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ /* Only a REINDEX can set this field */
+ Assert(!metad->btm_safededup);
+ metad->btm_safededup = false;
/* Adjust pd_lower (see _bt_initmetapage() for details) */
((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup ||
+ metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_root != P_NONE);
rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_fastroot != P_NONE);
return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * Note that we rely on the assumption that this field will be zero'ed
+ * on indexes that were pg_upgrade'd.
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_safededup;;
+ }
+
+ /* Cache the metapage data for next time */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_safededup;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxBTreeIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = _bt_form_posting(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..23621cdd37 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res > 0)
+ low = mid + 1;
+ else if (res < 0)
+ high = mid;
+ else
+ return mid;
+ }
+
+ /* Exact match not found */
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +656,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +691,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +806,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (!BTreeTupleIsPosting(itup) || result <= 0)
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1230,6 +1340,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
+ inskey.safededup = false; /* unused */
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1451,6 +1562,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1597,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Setup state to return posting list, and save first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Save additional posting list "logical" tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxBTreeIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1569,8 +1702,36 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Setup state to return posting list, and save last
+ * "logical" tuple from posting list (since it's the first
+ * that will be returned to scan).
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Return posting list "logical" tuples -- do this in
+ * descending order, to match overall scan order
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1745,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1759,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1773,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save base IndexTuple (truncate posting list) */
+ IndexTuple base;
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+ memcpy(base, itup, itupsz);
+ /* Defensively reduce work area index tuple header size */
+ base->t_info &= ~INDEX_SIZE_MASK;
+ base->t_info |= itupsz;
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c11a3fb570..84bee940b3 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
BlockNumber btps_blkno; /* block # to write this page at */
IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */
OffsetNumber btps_lastoff; /* last item offset loaded */
+ Size btps_lastextra; /* last item's extra posting list space */
uint32 btps_level; /* tree level (0 = leaf) */
Size btps_full; /* "full" if less than this much free space */
struct BTPageState *btps_next; /* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
- IndexTuple itup);
+ IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
state->btps_lowkey = NULL;
/* initialize lastoff so first item goes into P_FIRSTKEY */
state->btps_lastoff = P_HIKEY;
+ state->btps_lastextra = 0;
state->btps_level = level;
/* set "full" threshold based on level. See notes at head of file. */
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
- BTREE_DEFAULT_FILLFACTOR);
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+ BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -790,7 +795,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -822,14 +828,27 @@ _bt_sortaddtup(Page page,
* the truncated high key at offset 1.
*
* 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any. This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off. Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page). Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
*----------
*/
static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+ Size truncextra)
{
Page npage;
BlockNumber nblkno;
OffsetNumber last_off;
+ Size last_truncextra;
Size pgspc;
Size itupsz;
bool isleaf;
@@ -843,6 +862,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
npage = state->btps_page;
nblkno = state->btps_blkno;
last_off = state->btps_lastoff;
+ last_truncextra = state->btps_lastextra;
+ state->btps_lastextra = truncextra;
pgspc = PageGetFreeSpace(npage);
itupsz = IndexTupleSize(itup);
@@ -884,10 +905,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* page. Disregard fillfactor and insert on "full" current page if we
* don't have the minimum number of items yet. (Note that we deliberately
* assume that suffix truncation neither enlarges nor shrinks new high key
- * when applying soft limit.)
+ * when applying soft limit, except when last tuple had a posting list.)
*/
if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
- (pgspc < state->btps_full && last_off > P_FIRSTKEY))
+ (pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
{
/*
* Finish off the page and write it out.
@@ -945,11 +966,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* We don't try to bias our choice of split point to make it more
* likely that _bt_truncate() can truncate away more attributes,
* whereas the split point used within _bt_split() is chosen much
- * more delicately. Suffix truncation is mostly useful because it
- * improves space utilization for workloads with random
- * insertions. It doesn't seem worthwhile to add logic for
- * choosing a split point here for a benefit that is bound to be
- * much smaller.
+ * more delicately. On the other hand, non-unique index builds
+ * usually deduplicate, which often results in every "physical"
+ * tuple on the page having distinct key values. When that
+ * happens, _bt_truncate() will never need to include a heap TID
+ * in the new high key.
*
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
@@ -984,7 +1005,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
- _bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+ _bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
pfree(state->btps_lowkey);
/*
@@ -1046,6 +1067,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+ Size truncextra;
+
+ Assert(dstate->nitems > 0);
+ truncextra = 0;
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = _bt_form_posting(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ /* Determine size of posting list */
+ truncextra = IndexTupleSize(final) -
+ BTreeTupleGetPostingOffset(final);
+ }
+
+ _bt_buildadd(wstate, state, final, truncextra);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1091,7 +1153,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST(opaque));
BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
- _bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+ _bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
pfree(s->btps_lowkey);
s->btps_lowkey = NULL;
}
@@ -1112,7 +1174,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+ _bt_initmetapage(metapage, rootblkno, rootlevel,
+ wstate->inskey->safededup);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1133,6 +1196,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->safededup &&
+ BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1229,12 +1296,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (load1)
{
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
itup = tuplesort_getindextuple(btspool->sortstate, true);
}
else
{
- _bt_buildadd(wstate, state, itup2);
+ _bt_buildadd(wstate, state, itup2, 0);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
}
@@ -1244,9 +1311,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->skippedbase = InvalidOffsetNumber;
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to the size of the free
+ * space we want to leave behind on the page, plus space for
+ * final item's line pointer (but make sure that posting list
+ * tuple size won't exceed the generic 1/3 of a page limit).
+ *
+ * This is more conservative than the approach taken in the
+ * retail insert path, but it allows us to get most of the
+ * space savings deduplication provides without noticeably
+ * impacting how much free space is left behind on each leaf
+ * page.
+ */
+ dstate->maxitemsize =
+ Min(BTMaxItemSize(state->btps_page),
+ MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+ /* Minimum posting tuple size used here is arbitrary: */
+ dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * maxitemsize limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending
+ * posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * maxitemsize limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
@@ -1254,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (state == NULL)
state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
/* Report progress */
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..8078522b5c 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
Size newitemsz; /* size of newitem (includes line pointer) */
bool is_leaf; /* T if splitting a leaf page */
bool is_rightmost; /* T if splitting rightmost page on level */
+ bool is_deduped; /* T if posting list truncation expected */
OffsetNumber newitemoff; /* where the new item is to be inserted */
int leftspace; /* space available for items on left page */
int rightspace; /* space available for items on right page */
@@ -167,7 +168,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
state.newitemsz = newitemsz;
state.is_leaf = P_ISLEAF(opaque);
state.is_rightmost = P_RIGHTMOST(opaque);
+ state.is_deduped = state.is_leaf && BtreeGetDoDedupOption(rel);
state.leftspace = leftspace;
state.rightspace = rightspace;
state.olddataitemstotal = olddataitemstotal;
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsz = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
if (newitemisfirstonright)
firstrightitemsz = state->newitemsz;
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /*
+ * Calculate suffix truncation space saving when firstright is a
+ * posting list tuple.
+ *
+ * Individual posting lists often take up a significant fraction of
+ * all space on a page. Failing to consider that the new high key
+ * won't need to store the posting list a second time really matters.
+ */
+ if (state->is_leaf && state->is_deduped)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsz = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead.
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsz) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7669a1a66f..2601b59f29 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/catalog.h"
#include "commands/progress.h"
#include "lib/qunique.h"
#include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+ key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+ _bt_safededup(rel);
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2014,7 +2053,31 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ relopt_value *options;
+ BtreeOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL,
+ offsetof(BtreeOptions, deduplication)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BTREE,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ return NULL;
+
+ rdopts = allocateReloptStruct(sizeof(BtreeOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BtreeOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+ return (bytea *) rdopts;
}
/*
@@ -2127,6 +2190,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2143,6 +2224,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2233,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2157,7 +2258,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2175,6 +2277,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2187,7 +2290,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2301,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2213,7 +2319,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2222,7 +2328,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2310,6 +2417,10 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where _bt_opclasses_support_dedup()
+ * report that deduplication is safe, this function is guaranteed to give the
+ * same result as _bt_keep_natts().
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2387,22 +2498,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2446,12 +2565,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2477,7 +2596,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2547,11 +2670,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds. Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+ /* INCLUDE indexes don't support deduplication */
+ if (IndexRelationGetNumberOfAttributes(index) !=
+ IndexRelationGetNumberOfKeyAttributes(index))
+ return false;
+
+ /*
+ * There is no reason why deduplication cannot be used with system catalog
+ * indexes. However, we deem it generally unsafe because it's not clear
+ * how it could be disabled. (ALTER INDEX is not supported with system
+ * catalog indexes, so users have no way to set the "deduplicate" storage
+ * parameter.)
+ */
+ if (IsCatalogRelation(index))
+ return false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+
+ /* NUMERIC btree opfamily OID is 1988 */
+ if (opfamily == 1988)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..d36d31c758 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx; /* working memory for operations */
/*
* _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_safededup = xlrec->btm_safededup;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_swap_posting() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_swap_posting(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->skippedbase = InvalidOffsetNumber;
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +988,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2b1e3cda4a..bf4a27ab75 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX <foo> SET|RESET ( */
else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
COMPLETE_WITH("fillfactor",
- "vacuum_cleanup_index_scale_factor", /* BTREE */
+ "vacuum_cleanup_index_scale_factor", "deduplication", /* BTREE */
"fastupdate", "gin_pending_list_limit", /* GIN */
"buffering", /* GiST */
"pages_per_range", "autosummarize" /* BRIN */
);
else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
COMPLETE_WITH("fillfactor =",
- "vacuum_cleanup_index_scale_factor =", /* BTREE */
+ "vacuum_cleanup_index_scale_factor =", "deduplication =", /* BTREE */
"fastupdate =", "gin_pending_list_limit =", /* GIN */
"buffering =", /* GiST */
"pages_per_range =", "autosummarize =" /* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..cfdc968c6d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1121,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1154,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1195,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1221,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1236,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1251,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1265,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2029,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2045,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2110,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2190,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2198,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2654,29 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
}
/*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
*/
static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ Assert(state->heapkeyspace);
+
+ /*
+ * Make sure that tuple type (pivot vs non-pivot) matches caller's
+ * expectation
+ */
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..a231bbe1f2 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,55 @@ returns bool
<sect1 id="btree-implementation">
<title>Implementation</title>
+ <para>
+ Internally, a B-tree index consists of a tree structure with leaf
+ pages. Each leaf page contains tuples that point to table entries
+ using a heap item pointer. Each tuple's key is unique, since the
+ item pointer is treated as part of the key.
+ </para>
+ <para>
+ An introduction to the btree index implementation can be found in
+ <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+ <title>Deduplication</title>
<para>
- An introduction to the btree index implementation can be found in
- <filename>src/backend/access/nbtree/README</filename>.
+ B-Tree supports <firstterm>deduplication</firstterm>. Existing
+ leaf page tuples with fully equal keys prior to the heap item
+ pointer are folded together into a compressed representation called
+ a <quote>posting list</quote>. The user-visible keys appear only
+ once, followed by a simple list of heap item pointers. Posting
+ lists are formed at the point where an insertion would otherwise
+ have to split the page. This can greatly increase index space
+ efficiency with data sets where each distinct key appears a few
+ times on average. Cases that don't benefit will incur a small
+ performance penalty.
+ </para>
+ <para>
+ Deduplication can only be used with indexes that use B-Tree
+ operator classes that were declared <literal>BITWISE</literal>.
+ Deduplication is not supported with nondeterministic collations,
+ nor is it supported with <literal>INCLUDE</literal> indexes. The
+ deduplication storage parameter must be set to
+ <literal>ON</literal> for new posting lists to be formed
+ (deduplication is enabled by default in the case of non-unique
+ indexes).
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+ <title>Unique indexes and deduplication</title>
+
+ <para>
+ Unique indexes can also use deduplication. This can be useful with
+ unique indexes that are prone to becoming bloated despite
+ aggressive vacuuming. Deduplication may delay leaf page splits for
+ long enough that vacuuming can prevent unnecesary page splits
+ altogether.
</para>
+ </sect2>
</sect1>
</chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
nondeterministic collations give a more <quote>correct</quote> behavior,
especially when considering the full power of Unicode and its many
special cases, they also have some drawbacks. Foremost, their use leads
- to a performance penalty. Also, certain operations are not possible with
- nondeterministic collations, such as pattern matching operations.
- Therefore, they should be used only in cases where they are specifically
- wanted.
+ to a performance penalty. Note, in particular, that B-tree cannot use
+ deduplication with indexes that use a nondeterministic collation. Also,
+ certain operations are not possible with nondeterministic collations,
+ such as pattern matching operations. Therefore, they should be used
+ only in cases where they are specifically wanted.
</para>
</sect3>
</sect2>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..2261226965 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
maximum size allowed for the index type, data insertion will fail.
In any case, non-key columns duplicate data from the index's table
and bloat the size of the index, thus potentially slowing searches.
+ Moreover, B-tree deduplication is never used with indexes that
+ have a non-key column.
</para>
<para>
@@ -388,10 +390,38 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</variablelist>
<para>
- B-tree indexes additionally accept this parameter:
+ B-tree indexes also accept these parameters:
</para>
<variablelist>
+ <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+ <term><literal>deduplication</literal>
+ <indexterm>
+ <primary><varname>deduplication</varname></primary>
+ <secondary>storage parameter</secondary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This setting controls usage of the B-tree deduplication
+ technique described in <xref linkend="btree-deduplication"/>.
+ Defaults to <literal>ON</literal> for non-unique indexes, and
+ <literal>OFF</literal> for unique indexes. (Alternative
+ spellings of <literal>ON</literal> and <literal>OFF</literal>
+ are allowed as described in <xref linkend="config-setting"/>.)
+ </para>
+
+ <note>
+ <para>
+ Turning <literal>deduplication</literal> off via <command>ALTER
+ INDEX</command> prevents future insertions from triggering
+ deduplication, but does not in itself make existing posting list
+ tuples use the standard tuple representation.
+ </para>
+ </note>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
<term><literal>vacuum_cleanup_index_scale_factor</literal>
<indexterm>
@@ -446,9 +476,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
This setting controls usage of the fast update technique described in
<xref linkend="gin-fast-update"/>. It is a Boolean parameter:
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
- (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
- allowed as described in <xref linkend="config-setting"/>.) The
- default is <literal>ON</literal>.
+ The default is <literal>ON</literal>.
</para>
<note>
@@ -831,6 +859,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
</programlisting>
</para>
+ <para>
+ To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+ </para>
+
<para>
To create a <acronym>GIN</acronym> index with fast updates disabled:
<programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
<listitem>
<para>
- You have altered a storage parameter (such as fillfactor)
- for an index, and wish to ensure that the change has taken full effect.
+ You have altered a storage parameter (such as fillfactor or
+ deduplication) for an index, and wish to ensure that the change has
+ taken full effect.
</para>
</listitem>
--
2.17.1
On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.
I (continue to) think that deduplication is a terrible name, because
you're not getting rid of the duplicates. You are using a compressed
representation of the duplicates.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Nov 13, 2019 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.I (continue to) think that deduplication is a terrible name, because
you're not getting rid of the duplicates. You are using a compressed
representation of the duplicates.
"Deduplication" never means that you get rid of duplicates. According
to Wikipedia's deduplication article: "Whereas compression algorithms
identify redundant data inside individual files and encodes this
redundant data more efficiently, the intent of deduplication is to
inspect large volumes of data and identify large sections – such as
entire files or large sections of files – that are identical, and
replace them with a shared copy".
This seemed like it fit what this patch does. We're concerned with a
specific, simple kind of redundancy. Also:
* From the user's point of view, we're merging together what they'd
call duplicates. They don't really think of the heap TID as part of
the key.
* The term "compression" suggests a decompression penalty when
reading, which is not the case here.
* The term "compression" confuses the feature added by the patch with
TOAST compression. Now we may have two very different varieties of
compression in the same index.
Can you suggest an alternative?
--
Peter Geoghegan
On Wed, Nov 13, 2019 at 2:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
"Deduplication" never means that you get rid of duplicates. According
to Wikipedia's deduplication article: "Whereas compression algorithms
identify redundant data inside individual files and encodes this
redundant data more efficiently, the intent of deduplication is to
inspect large volumes of data and identify large sections – such as
entire files or large sections of files – that are identical, and
replace them with a shared copy".
Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.
Can you suggest an alternative?
My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com> wrote:
Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.
I see your point.
One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").
That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.
Can you suggest an alternative?
My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.
Does anyone else want to weigh in on this? Anastasia?
I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.
--
Peter Geoghegan
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)
I used a test script similar to the "nbtree_wal_test.sql" test script
I posted on September 11th today. I am concerned about the WAL
overhead for cases that don't benefit from the patch (usually because
they turn off deduplication altogether). The details of the index
tested were different this time, though. I used an index that had the
smallest possible tuple size: 16 bytes (this is the smallest possible
size on 64-bit systems, but that's what almost everybody uses these
days). So any index with one or two int4 columns (or one int8 column)
will generally have 16 byte IndexTuples, at least when there are no
NULLs in the index. In general, 16 byte wide tuples are very, very
common.
What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)
The way that *alignment* of WAL records affects these common 16 byte
IndexTuple cases is the real problem. Adding "postingoff" to
xl_btree_insert increases the WAL required for INSERT_LEAF records by
two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output
shows that they're 66 bytes, whereas they're only 64 bytes on the
master branch. That doesn't sound that bad, but once you consider the
alignment of whole records, it's really an extra 8 bytes. That is
totally unacceptable. The vast majority of nbtree WAL records are
bound to be INSERT_LEAF records, so as things stand we have added
(almost) 12.5% space overhead to nbtree for these common cases, that
don't benefit.
I haven't really looked into other types of WAL record just yet. The
real world overhead that we're adding to xl_btree_vacuum records is
something that I will have to look into separately. I'm already pretty
sure that adding two bytes to xl_btree_split is okay, though, because
they're far less numerous than xl_btree_insert records, and aren't
affected by alignment in the same way (they're already several hundred
bytes in almost all cases).
I also noticed something positive: The overhead of xl_btree_dedup WAL
records seems to be very low with indexes that have hundreds of
logical tuples for each distinct integer value. We don't seem to have
a problem with "deduplication thrashing".
--
Peter Geoghegan
On 11/13/19 11:51 AM, Peter Geoghegan wrote:
Can you suggest an alternative?
Dupression
--
Mark Dilger
On Fri, Nov 15, 2019 at 5:43 PM Mark Dilger <hornschnorter@gmail.com> wrote:
On 11/13/19 11:51 AM, Peter Geoghegan wrote:
Can you suggest an alternative?
Dupression
This suggestion makes me feel better about "deduplication".
--
Peter Geoghegan
On Sun, Sep 15, 2019 at 3:47 AM Oleg Bartunov <obartunov@postgrespro.ru> wrote:
Is it worth to make a provision to add an ability to control how
duplicates are sorted ?
Duplicates will continue to be sorted based on TID, in effect. We want
to preserve the ability to perform retail index tuple deletion. I
believe that that will become important in the future.
If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?
FWIW, I think that it's confusing that RUM almost shares its name with
the "RUM conjecture":
http://daslab.seas.harvard.edu/rum-conjecture/
--
Peter Geoghegan
Moin,
On 2019-11-16 01:04, Peter Geoghegan wrote:
On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com>
wrote:Hmm. Well, maybe I'm just behind the times. But that same wikipedia
article also says that deduplication works on large chunks "such as
entire files or large sections of files" thus differentiating it from
compression algorithms which work on the byte level, so it seems to me
that what you are doing still sounds more like ad-hoc compression.I see your point.
One reason for my avoiding the word "compression" is that other DB
systems that have something similar don't use the word compression
either. Actually, they don't really call it *anything*. Posting lists
are simply the way that secondary indexes work. The "Modern B-Tree
techniques" book/survey paper mentions the idea of using a TID list in
its "3.7 Duplicate Key Values" section, not in the two related
sections that follow ("Bitmap Indexes", and "Data Compression").That doesn't seem like a very good argument, now that I've typed it
out. The patch applies deduplication/compression/whatever at the point
where we'd otherwise have to split the page, unlike GIN. GIN eagerly
maintains posting lists (doing in-place updates for most insertions
seems pretty bad to me). My argument could reasonably be made about
GIN, which really does consider posting lists the natural way to store
duplicate tuples. I cannot really make that argument about nbtree with
this patch, though -- delaying a page split by re-encoding tuples
(changing their physical representation without changing their logical
contents) justifies using the word "compression" in the name.Can you suggest an alternative?
My instinct is to pick a name that somehow involves compression and
just put enough other words in there to make it clear e.g. duplicate
value compression, or something of that sort.Does anyone else want to weigh in on this? Anastasia?
I will go along with whatever the consensus is. I'm very close to the
problem we're trying to solve, which probably isn't helping me here.
I'm in favor of deduplication and not compression. Compression is a more
generic term and can involve deduplication, but it hasn't to do so. (It
could for instance just encode things in a more compact form). While
deduplication does not involve compression, it just means store multiple
things once, which by coincidence also amounts to using less space like
compression can do.
ZFS also follows this by having both deduplication (store the same
blocks only once with references) and compression (compress block
contents, regardless wether they are stored once or many times).
So my vote is for deduplication (if I understand the thread correctly
this is what the code no does, by storing the exact same key not that
many times but only once with references or a count?).
best regards,
Tels
On Fri, Nov 15, 2019 at 5:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)
Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat, but it does fix the bitrot against the
master branch that was caused by commit 50d22de9. (This patch has had
a surprisingly large number of conflicts against the master branch
recently.)
Other changes:
* The pageinspect patch has been cleaned up. I now propose that it be
committed alongside the main patch.
The big change here is that posting lists are represented as an array
of TIDs within bt_page_items(), much like gin_leafpage_items(). Also
added documentation that goes into the ways in which ctid can be used
to encode information (arguably some of this should have been included
with the Postgres 12 B-Tree work).
* Basic tests that cover deduplication within unique indexes. We ought
to have code coverage of the case where _bt_check_unique() has to step
right (actually, we don't have that on the master branch either).
--
Peter Geoghegan
Attachments:
v24-0002-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/octet-stream; name=v24-0002-Teach-pageinspect-about-nbtree-posting-lists.patchDownload
From b9835f1bf8426b50cbd0fc0b0804101f91efc9a6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v24 2/2] Teach pageinspect about nbtree posting lists.
Add a column for posting list TIDs to bt_page_items(). Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID. In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.
Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.
No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
contrib/pageinspect/btreefuncs.c | 110 +++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +
contrib/pageinspect/pageinspect--1.7--1.8.sql | 36 ++++++
doc/src/sgml/pageinspect.sgml | 80 +++++++------
4 files changed, 180 insertions(+), 52 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..418eef032d 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
#include "access/relation.h"
#include "catalog/namespace.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
#include "funcapi.h"
#include "miscadmin.h"
#include "pageinspect.h"
+#include "utils/array.h"
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X) PointerGetDatum(X)
/* note: BlockNumber is unsigned, hence can't be negative */
#define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
{
Page page;
OffsetNumber offset;
+ bool leafpage;
+ bool rightmost;
+ TupleDesc tupd;
};
/*-------------------------------------------------------
@@ -252,17 +259,24 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
{
- char *values[6];
+ Page page = uargs->page;
+ OffsetNumber offset = uargs->offset;
+ bool leafpage = uargs->leafpage;
+ bool rightmost = uargs->rightmost;
+ bool pivotoffset;
+ Datum values[9];
+ bool nulls[9];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
int j;
int off;
int dlen;
- char *dump;
+ char *dump, *datacstring;
char *ptr;
+ ItemPointer htid;
id = PageGetItemId(page, offset);
@@ -272,18 +286,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
itup = (IndexTuple) PageGetItem(page, id);
j = 0;
- values[j++] = psprintf("%d", offset);
- values[j++] = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
- ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
- values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
- values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
- values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+ memset(nulls, 0, sizeof(nulls));
+ values[j++] = DatumGetInt16(offset);
+ values[j++] = ItemPointerGetDatum(&itup->t_tid);
+ values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+ values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+ values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+ /*
+ * Make sure that "data" column does not include posting list or pivot
+ * heap tuple representation
+ */
+ if (BTreeTupleIsPosting(itup))
+ dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+ else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+ dlen -= MAXALIGN(sizeof(ItemPointerData));
+
dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
+ datacstring = dump;
for (off = 0; off < dlen; off++)
{
if (off > 0)
@@ -291,8 +314,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
sprintf(dump, "%02x", *(ptr + off) & 0xff);
dump += 2;
}
+ values[j++] = CStringGetTextDatum(datacstring);
+ pfree(datacstring);
- tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+ /*
+ * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+ * have v4+ status bit set) is dead or has a heap TID -- that can only
+ * happen with non-pivot tuples. (Most backend code can use the
+ * heapkeyspace field from the metapage to figure out which representation
+ * to use, but we have to be a bit creative here.)
+ */
+ pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+ /* LP_DEAD status bit */
+ if (!pivotoffset)
+ values[j++] = BoolGetDatum(ItemIdIsDead(id));
+ else
+ nulls[j++] = true;
+
+ htid = BTreeTupleGetHeapTID(itup);
+ if (pivotoffset && !BTreeTupleIsPivot(itup))
+ htid = NULL;
+
+ if (htid)
+ values[j++] = ItemPointerGetDatum(htid);
+ else
+ nulls[j++] = true;
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* build an array of item pointers */
+ ItemPointer tids;
+ Datum *tids_datum;
+ int nposting;
+
+ tids = BTreeTupleGetPosting(itup);
+ nposting = BTreeTupleGetNPosting(itup);
+ tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+ for (int i = 0; i < nposting; i++)
+ tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+ values[j++] = PointerGetDatum(construct_array(tids_datum,
+ nposting,
+ TIDOID,
+ sizeof(ItemPointerData),
+ false, 's'));
+ pfree(tids_datum);
+ }
+ else
+ nulls[j++] = true;
+
+ /* Build and return the result tuple */
+ tuple = heap_form_tuple(uargs->tupd, values, nulls);
return HeapTupleGetDatum(tuple);
}
@@ -378,12 +450,13 @@ bt_page_items(PG_FUNCTION_ARGS)
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+ uargs->leafpage = P_ISLEAF(opaque);
+ uargs->rightmost = P_RIGHTMOST(opaque);
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
-
- fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+ uargs->tupd = tupleDesc;
fctx->user_fctx = uargs;
@@ -395,7 +468,7 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
@@ -463,12 +536,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+ uargs->leafpage = P_ISLEAF(opaque);
+ uargs->rightmost = P_RIGHTMOST(opaque);
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
-
- fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+ uargs->tupd = tupleDesc;
fctx->user_fctx = uargs;
@@ -480,7 +554,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen | 16
nulls | f
vars | f
data | 01 00 00 00 00 00 00 01
+dead | f
+htid | (0,1)
+tids |
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
@@ -54,6 +57,9 @@ itemlen | 16
nulls | f
vars | f
data | 01 00 00 00 00 00 00 01
+dead | f
+htid | (0,1)
+tids |
SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
ERROR: block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
RETURNS record
AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT dead boolean,
+ OUT htid tid,
+ OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT dead boolean,
+ OUT htid tid,
+ OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
-[ RECORD 1 ]-+-----
blkno | 1
type | l
-live_items | 256
+live_items | 224
dead_items | 0
-avg_item_size | 12
+avg_item_size | 16
page_size | 8192
-free_size | 4056
+free_size | 3668
btpo_prev | 0
btpo_next | 0
btpo | 0
@@ -356,33 +356,45 @@ btpo_flags | 3
<function>bt_page_items</function> returns detailed information about
all of the items on a B-tree index page. For example:
<screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset | ctid | itemlen | nulls | vars | data
-------------+---------+---------+-------+------+-------------
- 1 | (0,1) | 12 | f | f | 23 27 00 00
- 2 | (0,2) | 12 | f | f | 24 27 00 00
- 3 | (0,3) | 12 | f | f | 25 27 00 00
- 4 | (0,4) | 12 | f | f | 26 27 00 00
- 5 | (0,5) | 12 | f | f | 27 27 00 00
- 6 | (0,6) | 12 | f | f | 28 27 00 00
- 7 | (0,7) | 12 | f | f | 29 27 00 00
- 8 | (0,8) | 12 | f | f | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+ 1 | (40,1) | 16 | f | f | b8 05 00 00 00 00 00 00 | | |
+ 2 | (58,11) | 16 | f | f | 4a 04 00 00 00 00 00 00 | f | (58,11) |
+ 3 | (266,4) | 16 | f | f | 4b 04 00 00 00 00 00 00 | f | (266,4) |
+ 4 | (279,25) | 16 | f | f | 4c 04 00 00 00 00 00 00 | f | (279,25) |
+ 5 | (333,11) | 16 | f | f | 4d 04 00 00 00 00 00 00 | f | (333,11) |
+ 6 | (87,24) | 16 | f | f | 4e 04 00 00 00 00 00 00 | f | (87,24) |
+ 7 | (38,22) | 16 | f | f | 4f 04 00 00 00 00 00 00 | f | (38,22) |
+ 8 | (272,17) | 16 | f | f | 50 04 00 00 00 00 00 00 | f | (272,17) |
</screen>
- In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
- In an internal page, the block number part of <structfield>ctid</structfield>
- points to another page in the index itself, while the offset part
- (the second number) is ignored and is usually 1.
+ In a B-tree leaf page, <structfield>ctid</structfield> usually
+ points to a heap tuple, and <structfield>dead</structfield> may
+ indicate that the item has its <literal>LP_DEAD</literal> bit
+ set. In an internal page, the block number part of
+ <structfield>ctid</structfield> points to another page in the
+ index itself, while the offset part (the second number) encodes
+ metadata about the tuple. Posting list tuples on leaf pages
+ also use <structfield>ctid</structfield> for metadata.
+ <structfield>htid</structfield> always shows a single heap TID
+ for the tuple, regardless of how it is represented (internal
+ page tuples may need to store a heap TID when there are many
+ duplicate tuples on descendent leaf pages).
+ <structfield>tids</structfield> is a list of TIDs that is stored
+ within posting list tuples (tuples created by deduplication).
</para>
<para>
Note that the first item on any non-rightmost page (any page with
a non-zero value in the <structfield>btpo_next</structfield> field) is the
page's <quote>high key</quote>, meaning its <structfield>data</structfield>
serves as an upper bound on all items appearing on the page, while
- its <structfield>ctid</structfield> field is meaningless. Also, on non-leaf
- pages, the first real data item (the first item that is not a high
- key) is a <quote>minus infinity</quote> item, with no actual value
- in its <structfield>data</structfield> field. Such an item does have a valid
- downlink in its <structfield>ctid</structfield> field, however.
+ its <structfield>ctid</structfield> field does not point to
+ another block. Also, on non-leaf pages, the first real data item
+ (the first item that is not a high key) is a <quote>minus
+ infinity</quote> item, with no actual value in its
+ <structfield>data</structfield> field. Such an item does have a
+ valid downlink in its <structfield>ctid</structfield> field,
+ however.
</para>
</listitem>
</varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
with <function>get_raw_page</function> should be passed as argument. So
the last example could also be rewritten like this:
<screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset | ctid | itemlen | nulls | vars | data
-------------+---------+---------+-------+------+-------------
- 1 | (0,1) | 12 | f | f | 23 27 00 00
- 2 | (0,2) | 12 | f | f | 24 27 00 00
- 3 | (0,3) | 12 | f | f | 25 27 00 00
- 4 | (0,4) | 12 | f | f | 26 27 00 00
- 5 | (0,5) | 12 | f | f | 27 27 00 00
- 6 | (0,6) | 12 | f | f | 28 27 00 00
- 7 | (0,7) | 12 | f | f | 29 27 00 00
- 8 | (0,8) | 12 | f | f | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+ 1 | (40,1) | 16 | f | f | b8 05 00 00 00 00 00 00 | | |
+ 2 | (58,11) | 16 | f | f | 4a 04 00 00 00 00 00 00 | f | (58,11) |
+ 3 | (266,4) | 16 | f | f | 4b 04 00 00 00 00 00 00 | f | (266,4) |
+ 4 | (279,25) | 16 | f | f | 4c 04 00 00 00 00 00 00 | f | (279,25) |
+ 5 | (333,11) | 16 | f | f | 4d 04 00 00 00 00 00 00 | f | (333,11) |
+ 6 | (87,24) | 16 | f | f | 4e 04 00 00 00 00 00 00 | f | (87,24) |
+ 7 | (38,22) | 16 | f | f | 4f 04 00 00 00 00 00 00 | f | (38,22) |
+ 8 | (272,17) | 16 | f | f | 50 04 00 00 00 00 00 00 | f | (272,17) |
</screen>
All the other details are the same as explained in the previous item.
</para>
--
2.17.1
v24-0001-Add-deduplication-to-nbtree.patchapplication/octet-stream; name=v24-0001-Add-deduplication-to-nbtree.patchDownload
From 7c77d41afd91d2021948fd03be82129b9452b9a5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v24 1/2] Add deduplication to nbtree
---
src/include/access/nbtree.h | 333 ++++++++--
src/include/access/nbtxlog.h | 68 ++-
src/include/access/rmgrlist.h | 2 +-
src/backend/access/common/reloptions.c | 11 +-
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/README | 74 ++-
src/backend/access/nbtree/nbtdedup.c | 710 ++++++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c | 321 +++++++++-
src/backend/access/nbtree/nbtpage.c | 211 ++++++-
src/backend/access/nbtree/nbtree.c | 174 +++++-
src/backend/access/nbtree/nbtsearch.c | 250 +++++++-
src/backend/access/nbtree/nbtsort.c | 209 ++++++-
src/backend/access/nbtree/nbtsplitloc.c | 38 +-
src/backend/access/nbtree/nbtutils.c | 216 ++++++-
src/backend/access/nbtree/nbtxlog.c | 218 ++++++-
src/backend/access/rmgrdesc/nbtdesc.c | 28 +-
src/bin/psql/tab-complete.c | 4 +-
contrib/amcheck/verify_nbtree.c | 180 ++++--
doc/src/sgml/btree.sgml | 48 +-
doc/src/sgml/charset.sgml | 9 +-
doc/src/sgml/ref/create_index.sgml | 43 +-
doc/src/sgml/ref/reindex.sgml | 5 +-
src/test/regress/expected/btree_index.out | 16 +
src/test/regress/sql/btree_index.sql | 17 +
25 files changed, 2945 insertions(+), 245 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 4a80e84aa7..1c82357e0d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -23,6 +23,36 @@
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+/*
+ * Storage type for Btree's reloptions
+ */
+typedef struct BtreeOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int fillfactor; /* leaf fillfactor */
+ double vacuum_cleanup_index_scale_factor;
+ bool deduplication; /* Use deduplication where safe? */
+} BtreeOptions;
+
+/*
+ * Deduplication is enabled for non unique indexes and disabled for unique
+ * indexes by default
+ */
+#define BtreeDefaultDoDedup(relation) \
+ (relation->rd_index->indisunique ? false : true)
+
+#define BtreeGetDoDedupOption(relation) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->deduplication : \
+ BtreeDefaultDoDedup(relation))
+
+#define BtreeGetFillFactor(relation, defaultff) \
+ ((relation)->rd_options ? \
+ ((BtreeOptions *) (relation)->rd_options)->fillfactor : (defaultff))
+
+#define BtreeGetTargetPageFreeSpace(relation, defaultff) \
+ (BLCKSZ * (100 - BtreeGetFillFactor(relation, defaultff)) / 100)
+
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -107,6 +137,7 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_safededup; /* deduplication known to be safe? */
} BTMetaPageData;
#define BTPageGetMeta(p) \
@@ -114,7 +145,8 @@ typedef struct BTMetaPageData
/*
* The current Btree version is 4. That's what you'll get when you create
- * a new index.
+ * a new index. The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
*
* Btree version 3 was used in PostgreSQL v11. It is mostly the same as
* version 4, but heap TIDs were not part of the keyspace. Index tuples
@@ -131,8 +163,8 @@ typedef struct BTMetaPageData
#define BTREE_METAPAGE 0 /* first page is meta */
#define BTREE_MAGIC 0x053162 /* magic number in metapage */
#define BTREE_VERSION 4 /* current version number */
-#define BTREE_MIN_VERSION 2 /* minimal supported version number */
-#define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */
+#define BTREE_MIN_VERSION 2 /* minimum supported version */
+#define BTREE_NOVAC_VERSION 3 /* version with all meta fields set */
/*
* Maximum size of a btree index entry, including its tuple header.
@@ -154,6 +186,26 @@ typedef struct BTMetaPageData
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page. This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound. A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples. (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area). The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable. There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+ (int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+ sizeof(ItemPointerData))
/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -229,16 +281,15 @@ typedef struct BTMetaPageData
* tuples (non-pivot tuples). _bt_check_natts() enforces the rules
* described here.
*
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
*
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
* separated. In general, any B-Tree index that has more than one level
* (i.e. any index that does not just consist of a metapage and a single
* leaf root page) must have some number of pivot tuples, since pivot
@@ -282,20 +333,103 @@ typedef struct BTMetaPageData
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
*
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID. Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication folds together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation. A posting list is
+ * an array of ItemPointerData elements. Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ * t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid. These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use. Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
@@ -326,40 +460,69 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (BTreeTupleIsPivot(itup)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
/*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -434,6 +597,11 @@ typedef BTStackData *BTStack;
* indexes whose version is >= version 4. It's convenient to keep this close
* by, rather than accessing the metapage repeatedly.
*
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use). This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
* anynullkeys indicates if any of the keys had NULL value when scankey was
* built from index tuple (note that already-truncated tuple key attributes
* set NULL as a placeholder key value, which also affects value of
@@ -469,6 +637,7 @@ typedef BTStackData *BTStack;
typedef struct BTScanInsertData
{
bool heapkeyspace;
+ bool safededup;
bool anynullkeys;
bool nextkey;
bool pivotsearch;
@@ -507,6 +676,13 @@ typedef struct BTInsertStateData
bool bounds_valid;
OffsetNumber low;
OffsetNumber stricthigh;
+
+ /*
+ * if _bt_binsrch_insert found the location inside existing posting list,
+ * save the position inside the list. This will be -1 in rare cases
+ * where the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
} BTInsertStateData;
typedef BTInsertStateData *BTInsertState;
@@ -534,7 +710,10 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -567,6 +746,12 @@ typedef struct BTScanPosData
*/
int nextTupleOffset;
+ /*
+ * Posting list tuples use postingTupleOffset to store the current
+ * location of the tuple that is returned multiple times.
+ */
+ int postingTupleOffset;
+
/*
* The items array is always ordered in index order (ie, increasing
* indexoffset). When scanning backwards it is convenient to fill the
@@ -578,7 +763,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxBTreeIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -680,6 +865,57 @@ typedef BTScanOpaqueData *BTScanOpaque;
#define SK_BT_DESC (INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
#define SK_BT_NULLS_FIRST (INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupState
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* Limit on size of final tuple */
+ IndexTuple newitem;
+ bool checkingunique; /* Use unique index strategy? */
+ OffsetNumber skippedbase; /* First offset skipped by checkingunique */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupState;
+
/*
* Constant definition for progress reporting. Phase numbers must match
* btbuildphasename.
@@ -725,6 +961,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState *state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState *state,
+ bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ int postingoff);
+
/*
* prototypes for functions in nbtinsert.c
*/
@@ -743,7 +995,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -751,6 +1004,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -762,6 +1016,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -812,6 +1068,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..b21e6f8082 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+/* 0x60 is unused */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,6 +54,7 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_safededup;
} xl_btree_metadata;
/*
@@ -61,16 +63,21 @@ typedef struct xl_btree_metadata
* This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
* Note that INSERT_META implies it's not a leaf page.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ * if postingoff is set, this started out as an insertion
+ * into an existing posting tuple at the offset before
+ * offnum (i.e. it's a posting list split). (REDO will
+ * have to update split posting list, too.)
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ OffsetNumber postingoff;
} xl_btree_insert;
-#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert (offsetof(xl_btree_insert, postingoff) + sizeof(OffsetNumber))
/*
* On insert with split, we save all the items going into the right sibling
@@ -91,9 +98,19 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set, and must use the
+ * posting offset to do an in-place update of the existing posting list that
+ * was actually split, and change the newitem to the "final" newitem. This
+ * corresponds to the xl_btree_insert postingoff-is-set case. postingoff
+ * won't be set when a posting list split occurs where both original posting
+ * list and newitem go on the right page.
*
* Backup Blk 1: new right page
*
@@ -111,10 +128,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ OffsetNumber postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ OffsetNumber nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(OffsetNumber))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -166,16 +199,27 @@ typedef struct xl_btree_reuse_page
* block numbers aren't given.
*
* Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * have a zero length array of target offsets (i.e. no deletes or updates).
+ * Earlier records must have at least one.
*/
typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the updated versions of tuples
+ * which follow array of offset numbers, needed when a posting list is
+ * vacuumed without killing all of its logical tuples.
+ */
+ uint32 nupdated;
+ uint32 ndeleted;
+
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW (if any) */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW (if any) */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -256,6 +300,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 3f22a6c354..8535e4210b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
@@ -1521,8 +1530,6 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
offsetof(StdRdOptions, user_catalog_table)},
{"parallel_workers", RELOPT_TYPE_INT,
offsetof(StdRdOptions, parallel_workers)},
- {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(StdRdOptions, vacuum_cleanup_index_scale_factor)},
{"vacuum_index_cleanup", RELOPT_TYPE_BOOL,
offsetof(StdRdOptions, vacuum_index_cleanup)},
{"vacuum_truncate", RELOPT_TYPE_BOOL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
nbtcompare.o \
+ nbtdedup.o \
nbtinsert.o \
nbtpage.o \
nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..dde1d68d6f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ * Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState *state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+ int count = 0;
+ bool singlevalue = false;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ state->skippedbase = InvalidOffsetNumber;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Determine if a "single value" strategy page split is likely to occur
+ * shortly after deduplication finishes. It should be possible for the
+ * single value split to find a split point that packs the left half of
+ * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+ */
+ if (!checkingunique)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ /*
+ * Use different strategy if future page split likely to need to
+ * use "single value" strategy
+ */
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ singlevalue = true;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+ offnum = minoff;
+retry:
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state. The next iteration
+ * will also end up here if it's possible to merge the next tuple
+ * into the same pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed BTMaxItemSize()
+ * limit).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page. Otherwise, reset
+ * the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ count++;
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication in case
+ * the remaining items are about to be marked dead within
+ * _bt_check_unique().
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /*
+ * Consider special steps when a future page split of the leaf
+ * page is likely to occur using nbtsplitloc.c's "single value"
+ * strategy
+ */
+ if (singlevalue)
+ {
+ /*
+ * Adjust maxitemsize so that there isn't a third and final
+ * 1/3 of a page width tuple that fills the page to capacity.
+ * The third tuple produced should be smaller than the first
+ * two by an amount equal to the free space that nbtsplitloc.c
+ * is likely to want to leave behind when the page it split.
+ * When there are 3 posting lists on the page, then we end
+ * deduplication. Remaining tuples on the page can be
+ * deduplicated later, when they're on the new right sibling
+ * of this page, and the new sibling page needs to be split in
+ * turn.
+ *
+ * Note that it doesn't matter if there are items on the page
+ * that were already 1/3 of a page during current pass;
+ * they'll still count as the first two posting list tuples.
+ */
+ if (count == 2)
+ {
+ Size leftfree;
+
+ /* This calculation needs to match nbtsplitloc.c */
+ leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+ MAXALIGN(sizeof(BTPageOpaqueData));
+ /* Subtract predicted size of new high key */
+ leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+ /*
+ * Reduce maxitemsize by an amount equal to target free
+ * space on left half of page
+ */
+ state->maxitemsize -= leftfree *
+ ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+ }
+ else if (count == 3)
+ break;
+ }
+
+ /*
+ * Next iteration starts immediately after base tuple offset (this
+ * will be the next offset on the page when we didn't modify the
+ * page)
+ */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ {
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+ count++;
+ }
+
+ if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+ {
+ /*
+ * Didn't free enough space for new item in first checkingunique pass.
+ * Try making a second pass over the page, this time starting from the
+ * first candidate posting list base offset that was skipped over in
+ * the first pass (only do a second pass when this actually happened).
+ *
+ * The second pass over the page may deduplicate items that were
+ * initially passed over due to concerns about limiting the
+ * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that the second pass will still stop deduplicating as soon as
+ * enough space has been freed to avoid an immediate page split.
+ */
+ Assert(state->checkingunique);
+ offnum = state->skippedbase;
+
+ state->checkingunique = false;
+ state->skippedbase = InvalidOffsetNumber;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState *state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState *state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique &&
+ (BTreeTupleIsPosting(state->base) || nhtids > 1))
+ {
+ /* May begin here if second pass over page is required */
+ if (state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ return false;
+ }
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState *state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique || state->nitems == 1 ||
+ state->nhtids == state->nitems);
+ if (state->checkingunique)
+ {
+ minimum = 3;
+ /* May begin here if second pass over page is required */
+ if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ }
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = _bt_form_posting(state->base, state->htids, state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order. Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'. Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps. (This approach avoids any explicit WAL-logging of a
+ * posting list tuple. This is important because posting lists are often much
+ * larger than plain tuples.)
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff > 0 && postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID)
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /* Fill the gap with the TID of the new item */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /* Copy original posting list's rightmost TID into new item */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+ return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b93b2a0ffd..0bfe9cdb7e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,10 +47,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, OffsetNumber postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +63,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
insertstate.itup_key = itup_key;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
/*
* It's very common to have an index on an auto-incremented or
@@ -300,7 +304,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -353,6 +357,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ bool inposting = false;
+ bool prev_all_dead = true;
+ int curposti = 0;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -374,6 +381,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/*
* Scan over all equal tuples, looking for live conflicts.
+ *
+ * Note that each iteration of the loop processes one heap TID, not one
+ * index tuple. The page offset number won't be advanced for iterations
+ * which process heap TIDs from posting list tuples until the last such
+ * heap TID for the posting list (curposti will be advanced instead).
*/
Assert(!insertstate->bounds_valid || insertstate->low == offset);
Assert(!itup_key->anynullkeys);
@@ -435,7 +447,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ /*
+ * decide if this is the first heap TID in tuple we'll
+ * process, or if we should continue to process current
+ * posting list
+ */
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ inposting = false;
+ }
+ else if (!inposting)
+ {
+ /* First heap TID in posting list */
+ inposting = true;
+ prev_all_dead = true;
+ curposti = 0;
+ }
+
+ if (inposting)
+ htid = *BTreeTupleGetPostingN(curitup, curposti);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +543,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,12 +601,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && (!inposting ||
+ (prev_all_dead &&
+ curposti == BTreeTupleGetNPosting(curitup) - 1)))
{
/*
- * The conflicting tuple (or whole HOT chain) is dead to
- * everyone, so we may as well mark the index entry
- * killed.
+ * The conflicting tuple (or all HOT chains pointed to by
+ * all posting list TIDs) is dead to everyone, so mark the
+ * index entry killed.
*/
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +622,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+
+ /*
+ * Remember if posting list tuple has even a single HOT chain
+ * whose members are not all dead
+ */
+ if (!all_dead && inposting)
+ prev_all_dead = false;
}
}
- /*
- * Advance to next tuple to continue checking.
- */
- if (offset < maxoff)
+ if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+ {
+ /* Advance to next TID in same posting list */
+ curposti++;
+ continue;
+ }
+ else if (offset < maxoff)
+ {
+ /* Advance to next tuple */
+ curposti = 0;
+ inposting = false;
offset = OffsetNumberNext(offset);
+ }
else
{
int highkeycmp;
@@ -621,6 +669,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
+ curposti = 0;
+ inposting = false;
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
/* Don't invalidate binary search bounds */
@@ -689,6 +739,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +802,26 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->safededup &&
+ BtreeGetDoDedupOption(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +903,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->safededup);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +1000,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1015,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1034,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1056,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1068,39 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple on a leaf page. Prepare to
+ * split an existing posting list by swapping new item's heap TID with
+ * the rightmost heap TID from original posting list, and generating a
+ * new version of the posting list that has new item's heap TID.
+ *
+ * Posting list splits work by modifying the overlapping posting list
+ * as part of the same atomic operation that inserts the "new item".
+ * The space accounting is kept simple, since it does not need to
+ * consider posting list splits at all (this is particularly important
+ * for the case where we also have to split the page). Overwriting
+ * the posting list with its post-split version is treated as an extra
+ * step in either the insert or page split critical section.
+ */
+ Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID for xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+ /* Alter offset so that it goes after existing posting list */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1133,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1213,13 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ if (nposting)
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1116,6 +1261,7 @@ _bt_insertonpg(Relation rel,
XLogRecPtr recptr;
xlrec.offnum = itup_off;
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1144,6 +1290,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1299,19 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1353,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1375,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, OffsetNumber postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1415,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1463,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1537,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1573,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1683,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1868,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = InvalidOffsetNumber;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1892,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in the record, though.
+ *
+ * The details are sometimes slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff isn't set in the WAL record, so
+ * recovery doesn't need to process a posting list split at all).
+ * Otherwise, we set postingoff and log orignewitem instead of
+ * newitem, despite having actually inserted newitem. Recovery must
+ * reconstruct nposting and newitem by calling _bt_swap_posting().
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != InvalidOffsetNumber)
+ {
+ if (xlrec.postingoff == InvalidOffsetNumber)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2090,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2446,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2303,6 +2560,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..77f443f7a9 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = safededup;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
metad->btm_version = BTREE_NOVAC_VERSION;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ /* Only a REINDEX can set this field */
+ Assert(!metad->btm_safededup);
+ metad->btm_safededup = false;
/* Adjust pd_lower (see _bt_initmetapage() for details) */
((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup ||
+ metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_root != P_NONE);
rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_fastroot != P_NONE);
return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * Note that we rely on the assumption that this field will be zero'ed
+ * on indexes that were pg_upgrade'd.
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_safededup;;
+ }
+
+ /* Cache the metapage data for next time */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_safededup;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -983,14 +1049,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
+
+ /* XLOG stuff, buffer for updateds */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page. */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1124,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nupdated = nupdatable;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1139,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
@@ -1041,6 +1160,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1067,8 +1271,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2066,6 +2270,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd5289ad..2cdc3d499f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -160,7 +162,7 @@ btbuildempty(Relation index)
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxBTreeIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -816,7 +818,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
}
else
{
- StdRdOptions *relopts;
+ BtreeOptions *relopts;
float8 cleanup_scale_factor;
float8 prev_num_heap_tuples;
@@ -827,7 +829,7 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
* tuples exceeds vacuum_cleanup_index_scale_factor fraction of
* original tuples count.
*/
- relopts = (StdRdOptions *) info->index->rd_options;
+ relopts = (BtreeOptions *) info->index->rd_options;
cleanup_scale_factor = (relopts &&
relopts->vacuum_cleanup_index_scale_factor >= 0)
? relopts->vacuum_cleanup_index_scale_factor
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1188,8 +1191,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1229,6 +1241,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1238,11 +1254,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that
@@ -1265,8 +1279,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = _bt_form_posting(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1274,7 +1351,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1290,7 +1367,8 @@ restart:
* doesn't seem worth the amount of bookkeeping it'd take to avoid
* that.
*/
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable,
vstate->lastBlockVacuumed);
/*
@@ -1300,7 +1378,7 @@ restart:
if (blkno > vstate->lastBlockVacuumed)
vstate->lastBlockVacuumed = blkno;
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1315,6 +1393,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1324,15 +1403,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1375,6 +1455,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c954926f2d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res > 0)
+ low = mid + 1;
+ else if (res < 0)
+ high = mid;
+ else
+ return mid;
+ }
+
+ /* Exact match not found */
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +652,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +687,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +802,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1230,6 +1336,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
+ inskey.safededup = false; /* unused */
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1451,6 +1558,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1484,9 +1592,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
- /* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Set up state to return posting list, and remember first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional logical tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxBTreeIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1568,9 +1698,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
&continuescan);
if (passes_quals && tuple_alive)
{
- /* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Set up state to return posting list, and remember last
+ * "logical" tuple (since we'll return it first)
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Remember additional logical tuples (use desc order to
+ * be consistent with order of entire scan)
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1742,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1756,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1770,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save base IndexTuple (truncate posting list) */
+ IndexTuple base;
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+ memcpy(base, itup, itupsz);
+ /* Defensively reduce work area index tuple header size */
+ base->t_info &= ~INDEX_SIZE_MASK;
+ base->t_info |= itupsz;
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fc7d43a0f3..ad961c305f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
BlockNumber btps_blkno; /* block # to write this page at */
IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */
OffsetNumber btps_lastoff; /* last item offset loaded */
+ Size btps_lastextra; /* last item's extra posting list space */
uint32 btps_level; /* tree level (0 = leaf) */
Size btps_full; /* "full" if less than this much free space */
struct BTPageState *btps_next; /* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
- IndexTuple itup);
+ IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState *dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -711,13 +715,14 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
state->btps_lowkey = NULL;
/* initialize lastoff so first item goes into P_FIRSTKEY */
state->btps_lastoff = P_HIKEY;
+ state->btps_lastextra = 0;
state->btps_level = level;
/* set "full" threshold based on level. See notes at head of file. */
if (level > 0)
state->btps_full = (BLCKSZ * (100 - BTREE_NONLEAF_FILLFACTOR) / 100);
else
- state->btps_full = RelationGetTargetPageFreeSpace(wstate->index,
- BTREE_DEFAULT_FILLFACTOR);
+ state->btps_full = BtreeGetTargetPageFreeSpace(wstate->index,
+ BTREE_DEFAULT_FILLFACTOR);
/* no parent level, yet */
state->btps_next = NULL;
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
* the truncated high key at offset 1.
*
* 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any. This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off. Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page). Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
*----------
*/
static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+ Size truncextra)
{
Page npage;
BlockNumber nblkno;
OffsetNumber last_off;
+ Size last_truncextra;
Size pgspc;
Size itupsz;
bool isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
npage = state->btps_page;
nblkno = state->btps_blkno;
last_off = state->btps_lastoff;
+ last_truncextra = state->btps_lastextra;
+ state->btps_lastextra = truncextra;
pgspc = PageGetFreeSpace(npage);
itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* page. Disregard fillfactor and insert on "full" current page if we
* don't have the minimum number of items yet. (Note that we deliberately
* assume that suffix truncation neither enlarges nor shrinks new high key
- * when applying soft limit.)
+ * when applying soft limit, except when last tuple had a posting list.)
*/
if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
- (pgspc < state->btps_full && last_off > P_FIRSTKEY))
+ (pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
{
/*
* Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* We don't try to bias our choice of split point to make it more
* likely that _bt_truncate() can truncate away more attributes,
* whereas the split point used within _bt_split() is chosen much
- * more delicately. Suffix truncation is mostly useful because it
- * improves space utilization for workloads with random
- * insertions. It doesn't seem worthwhile to add logic for
- * choosing a split point here for a benefit that is bound to be
- * much smaller.
+ * more delicately. On the other hand, non-unique index builds
+ * usually deduplicate, which often results in every "physical"
+ * tuple on the page having distinct key values. When that
+ * happens, _bt_truncate() will never need to include a heap TID
+ * in the new high key.
*
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
- _bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+ _bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
pfree(state->btps_lowkey);
/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState *dstate)
+{
+ IndexTuple final;
+ Size truncextra;
+
+ Assert(dstate->nitems > 0);
+ truncextra = 0;
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = _bt_form_posting(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ /* Determine size of posting list */
+ truncextra = IndexTupleSize(final) -
+ BTreeTupleGetPostingOffset(final);
+ }
+
+ _bt_buildadd(wstate, state, final, truncextra);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST(opaque));
BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
- _bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+ _bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
pfree(s->btps_lowkey);
s->btps_lowkey = NULL;
}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+ _bt_initmetapage(metapage, rootblkno, rootlevel,
+ wstate->inskey->safededup);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1132,6 +1195,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->safededup &&
+ BtreeGetDoDedupOption(wstate->index);
if (merge)
{
@@ -1228,12 +1295,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (load1)
{
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
itup = tuplesort_getindextuple(btspool->sortstate, true);
}
else
{
- _bt_buildadd(wstate, state, itup2);
+ _bt_buildadd(wstate, state, itup2, 0);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
}
@@ -1243,9 +1310,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState *dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState *) palloc(sizeof(BTDedupState));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->skippedbase = InvalidOffsetNumber;
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to the size of the free
+ * space we want to leave behind on the page, plus space for
+ * final item's line pointer (but make sure that posting list
+ * tuple size won't exceed the generic 1/3 of a page limit).
+ *
+ * This is more conservative than the approach taken in the
+ * retail insert path, but it allows us to get most of the
+ * space savings deduplication provides without noticeably
+ * impacting how much free space is left behind on each leaf
+ * page.
+ */
+ dstate->maxitemsize =
+ Min(BTMaxItemSize(state->btps_page),
+ MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+ /* Minimum posting tuple size used here is arbitrary: */
+ dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * maxitemsize limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending
+ * posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * maxitemsize limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
@@ -1253,7 +1424,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (state == NULL)
state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
/* Report progress */
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a04d4e25d6..8078522b5c 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
Size newitemsz; /* size of newitem (includes line pointer) */
bool is_leaf; /* T if splitting a leaf page */
bool is_rightmost; /* T if splitting rightmost page on level */
+ bool is_deduped; /* T if posting list truncation expected */
OffsetNumber newitemoff; /* where the new item is to be inserted */
int leftspace; /* space available for items on left page */
int rightspace; /* space available for items on right page */
@@ -167,7 +168,7 @@ _bt_findsplitloc(Relation rel,
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(page);
- leaffillfactor = RelationGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
+ leaffillfactor = BtreeGetFillFactor(rel, BTREE_DEFAULT_FILLFACTOR);
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
state.newitemsz = newitemsz;
state.is_leaf = P_ISLEAF(opaque);
state.is_rightmost = P_RIGHTMOST(opaque);
+ state.is_deduped = state.is_leaf && BtreeGetDoDedupOption(rel);
state.leftspace = leftspace;
state.rightspace = rightspace;
state.olddataitemstotal = olddataitemstotal;
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsz = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
if (newitemisfirstonright)
firstrightitemsz = state->newitemsz;
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /*
+ * Calculate suffix truncation space saving when firstright is a
+ * posting list tuple.
+ *
+ * Individual posting lists often take up a significant fraction of
+ * all space on a page. Failing to consider that the new high key
+ * won't need to store the posting list a second time really matters.
+ */
+ if (state->is_leaf && state->is_deduped)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsz = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead.
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsz) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7669a1a66f..ac8e403635 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/catalog.h"
#include "commands/progress.h"
#include "lib/qunique.h"
#include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+ key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+ _bt_safededup(rel);
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2014,7 +2053,18 @@ BTreeShmemInit(void)
bytea *
btoptions(Datum reloptions, bool validate)
{
- return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
+ static const relopt_parse_elt tab[] = {
+ {"fillfactor", RELOPT_TYPE_INT, offsetof(BtreeOptions, fillfactor)},
+ {"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
+ offsetof(BtreeOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL,
+ offsetof(BtreeOptions, deduplication)}
+ };
+
+ return (bytea *) build_reloptions(reloptions, validate,
+ RELOPT_KIND_BTREE,
+ sizeof(BtreeOptions),
+ tab, lengthof(tab));
}
/*
@@ -2127,6 +2177,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2143,6 +2211,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2150,6 +2220,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2157,7 +2245,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* It's necessary to add a heap TID attribute to the new pivot tuple.
*/
Assert(natts == nkeyatts);
- newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+ newsize = MAXALIGN(IndexTupleSize(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
pivot = palloc0(newsize);
memcpy(pivot, firstright, IndexTupleSize(firstright));
}
@@ -2175,6 +2264,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2187,7 +2277,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2198,9 +2288,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2213,7 +2306,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2222,7 +2315,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2303,13 +2397,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting. When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
*
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2387,22 +2484,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2446,12 +2551,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2477,7 +2582,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2547,11 +2656,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds. Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+ /* INCLUDE indexes don't support deduplication */
+ if (IndexRelationGetNumberOfAttributes(index) !=
+ IndexRelationGetNumberOfKeyAttributes(index))
+ return false;
+
+ /*
+ * There is no reason why deduplication cannot be used with system catalog
+ * indexes. However, we deem it generally unsafe because it's not clear
+ * how it could be disabled. (ALTER INDEX is not supported with system
+ * catalog indexes, so users have no way to set the "deduplicate" storage
+ * parameter.)
+ */
+ if (IsCatalogRelation(index))
+ return false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+
+ /* NUMERIC btree opfamily OID is 1988 */
+ if (opfamily == 1988)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..d36d31c758 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx; /* working memory for operations */
/*
* _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_safededup = xlrec->btm_safededup;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -181,9 +185,45 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (xlrec->postingoff == InvalidOffsetNumber)
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+
+ /*
+ * A posting list split occurred during insertion.
+ *
+ * Use _bt_swap_posting() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ Assert(isleaf);
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_swap_posting(newitem, oposting, xlrec->postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert new item */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +305,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_swap_posting(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +362,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +449,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState *state;
+
+ state = (BTDedupState *) palloc(sizeof(BTDedupState));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->skippedbase = InvalidOffsetNumber;
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -386,8 +534,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +626,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
@@ -820,7 +988,9 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
@@ -838,6 +1008,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -863,6 +1036,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..1dde2da285 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
- appendStringInfo(buf, "off %u", xlrec->offnum);
+ appendStringInfo(buf, "off %u; postingoff %u",
+ xlrec->offnum, xlrec->postingoff);
break;
}
case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,30 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff,
+ xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nupdated %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nupdated,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
@@ -131,6 +146,9 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 98c917bf7a..b2b29a1ae2 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX <foo> SET|RESET ( */
else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
COMPLETE_WITH("fillfactor",
- "vacuum_cleanup_index_scale_factor", /* BTREE */
+ "vacuum_cleanup_index_scale_factor", "deduplication", /* BTREE */
"fastupdate", "gin_pending_list_limit", /* GIN */
"buffering", /* GiST */
"pages_per_range", "autosummarize" /* BRIN */
);
else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
COMPLETE_WITH("fillfactor =",
- "vacuum_cleanup_index_scale_factor =", /* BTREE */
+ "vacuum_cleanup_index_scale_factor =", "deduplication =", /* BTREE */
"fastupdate =", "gin_pending_list_limit =", /* GIN */
"buffering =", /* GiST */
"pages_per_range =", "autosummarize =" /* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..8b1223a817 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,72 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
char *itid,
*htid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
- htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1120,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
if (OffsetNumberNext(offset) <= max &&
!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
{
+ ItemPointer tid;
char *itid,
*htid,
*nitid,
*nhtid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,25 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
}
/*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
*/
static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
- BlockNumber targetblock = state->targetblock;
+ Assert(state->heapkeyspace);
- if (result == NULL && nonpivot)
+ /*
+ * Make sure that tuple type (pivot vs non-pivot) matches caller's
+ * expectation
+ */
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
- targetblock, RelationGetRelationName(state->rel))));
+ state->targetblock,
+ RelationGetRelationName(state->rel))));
- return result;
+ return BTreeTupleGetHeapTID(itup);
}
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..a231bbe1f2 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,55 @@ returns bool
<sect1 id="btree-implementation">
<title>Implementation</title>
+ <para>
+ Internally, a B-tree index consists of a tree structure with leaf
+ pages. Each leaf page contains tuples that point to table entries
+ using a heap item pointer. Each tuple's key is unique, since the
+ item pointer is treated as part of the key.
+ </para>
+ <para>
+ An introduction to the btree index implementation can be found in
+ <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+ <title>Deduplication</title>
<para>
- An introduction to the btree index implementation can be found in
- <filename>src/backend/access/nbtree/README</filename>.
+ B-Tree supports <firstterm>deduplication</firstterm>. Existing
+ leaf page tuples with fully equal keys prior to the heap item
+ pointer are folded together into a compressed representation called
+ a <quote>posting list</quote>. The user-visible keys appear only
+ once, followed by a simple list of heap item pointers. Posting
+ lists are formed at the point where an insertion would otherwise
+ have to split the page. This can greatly increase index space
+ efficiency with data sets where each distinct key appears a few
+ times on average. Cases that don't benefit will incur a small
+ performance penalty.
+ </para>
+ <para>
+ Deduplication can only be used with indexes that use B-Tree
+ operator classes that were declared <literal>BITWISE</literal>.
+ Deduplication is not supported with nondeterministic collations,
+ nor is it supported with <literal>INCLUDE</literal> indexes. The
+ deduplication storage parameter must be set to
+ <literal>ON</literal> for new posting lists to be formed
+ (deduplication is enabled by default in the case of non-unique
+ indexes).
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+ <title>Unique indexes and deduplication</title>
+
+ <para>
+ Unique indexes can also use deduplication. This can be useful with
+ unique indexes that are prone to becoming bloated despite
+ aggressive vacuuming. Deduplication may delay leaf page splits for
+ long enough that vacuuming can prevent unnecesary page splits
+ altogether.
</para>
+ </sect2>
</sect1>
</chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
nondeterministic collations give a more <quote>correct</quote> behavior,
especially when considering the full power of Unicode and its many
special cases, they also have some drawbacks. Foremost, their use leads
- to a performance penalty. Also, certain operations are not possible with
- nondeterministic collations, such as pattern matching operations.
- Therefore, they should be used only in cases where they are specifically
- wanted.
+ to a performance penalty. Note, in particular, that B-tree cannot use
+ deduplication with indexes that use a nondeterministic collation. Also,
+ certain operations are not possible with nondeterministic collations,
+ such as pattern matching operations. Therefore, they should be used
+ only in cases where they are specifically wanted.
</para>
</sect3>
</sect2>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..2261226965 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
maximum size allowed for the index type, data insertion will fail.
In any case, non-key columns duplicate data from the index's table
and bloat the size of the index, thus potentially slowing searches.
+ Moreover, B-tree deduplication is never used with indexes that
+ have a non-key column.
</para>
<para>
@@ -388,10 +390,38 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</variablelist>
<para>
- B-tree indexes additionally accept this parameter:
+ B-tree indexes also accept these parameters:
</para>
<variablelist>
+ <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+ <term><literal>deduplication</literal>
+ <indexterm>
+ <primary><varname>deduplication</varname></primary>
+ <secondary>storage parameter</secondary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ This setting controls usage of the B-tree deduplication
+ technique described in <xref linkend="btree-deduplication"/>.
+ Defaults to <literal>ON</literal> for non-unique indexes, and
+ <literal>OFF</literal> for unique indexes. (Alternative
+ spellings of <literal>ON</literal> and <literal>OFF</literal>
+ are allowed as described in <xref linkend="config-setting"/>.)
+ </para>
+
+ <note>
+ <para>
+ Turning <literal>deduplication</literal> off via <command>ALTER
+ INDEX</command> prevents future insertions from triggering
+ deduplication, but does not in itself make existing posting list
+ tuples use the standard tuple representation.
+ </para>
+ </note>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
<term><literal>vacuum_cleanup_index_scale_factor</literal>
<indexterm>
@@ -446,9 +476,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
This setting controls usage of the fast update technique described in
<xref linkend="gin-fast-update"/>. It is a Boolean parameter:
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
- (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
- allowed as described in <xref linkend="config-setting"/>.) The
- default is <literal>ON</literal>.
+ The default is <literal>ON</literal>.
</para>
<note>
@@ -831,6 +859,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
</programlisting>
</para>
+ <para>
+ To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+ </para>
+
<para>
To create a <acronym>GIN</acronym> index with fast updates disabled:
<programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
<listitem>
<para>
- You have altered a storage parameter (such as fillfactor)
- for an index, and wish to ensure that the change has taken full effect.
+ You have altered a storage parameter (such as fillfactor or
+ deduplication) for an index, and wish to ensure that the change has
+ taken full effect.
</para>
</listitem>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index acab8e0b11..de55d3cc7c 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -199,6 +199,22 @@ reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+ FOR r IN 1..1350 LOOP
+ DELETE FROM dedup_unique_test_table;
+ INSERT INTO dedup_unique_test_table SELECT 1;
+ END LOOP;
+END$$;
+--
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
-- First create a tree that's at least three levels deep (i.e. has one level
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 48eaf4fe42..d175a19bf5 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -83,6 +83,23 @@ reset enable_seqscan;
reset enable_indexscan;
reset enable_bitmapscan;
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+ FOR r IN 1..1350 LOOP
+ DELETE FROM dedup_unique_test_table;
+ INSERT INTO dedup_unique_test_table SELECT 1;
+ END LOOP;
+END$$;
+
--
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
--
2.17.1
On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote:
Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat, but it does fix the bitrot against the
master branch that was caused by commit 50d22de9. (This patch has had
a surprisingly large number of conflicts against the master branch
recently.)
Please note that I have moved this patch to next CF per this last
update. Anastasia, the ball is waiting on your side of the field, as
the CF entry is marked as waiting on author for some time now.
--
Michael
On Mon, Nov 18, 2019 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat
Attached is v25. This version:
* Adds more documentation.
* Adds a new GUC -- bree_deduplication.
A new GUC seems necessary. Users will want to be able to configure the
feature system-wide. A storage parameter won't let them do that --
only a GUC will. This also makes it easy to enable the feature with
unique indexes.
* Fixes the xl_btree_insert record bloat issue.
* Fixes a smaller issue with VACUUM/xl_btree_vacuum record bloat.
We shouldn't be using noticeably more WAL than before, at least in
cases that don't use deduplication. These two items fix cases where
that was possible.
There is a new refactoring patch including with v25 that helps with
the xl_btree_vacuum issue. This new patch removes unnecessary "pin
scan" code used by B-Tree VACUUMs, which was effectively disabled by
commit 3e4b7d87 without being removed. This is independently useful
work that I planned on doing already, that also cleans things up for
VACUUM with posting list tuples. It reclaims some space within the
xl_btree_vacuum record type that was wasted (we don't even use the
lastBlockVacuumed field anymore), allowing us to use that space for
new deduplication-related fields without increasing total WAL space.
Anastasia: I hope to be able to commit the first patch before too
long. It would be great if you could review that.
--
Peter Geoghegan
Attachments:
v25-0004-DEBUG-Show-index-values-in-pageinspect.patchapplication/x-patch; name=v25-0004-DEBUG-Show-index-values-in-pageinspect.patchDownload
From fa83d38bfd1ad868b22ad5fc390447be81c1c704 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Nov 2019 19:35:30 -0800
Subject: [PATCH v25 4/4] DEBUG: Show index values in pageinspect
This is not intended for commit. It is included as a convenience for
reviewers.
---
contrib/pageinspect/btreefuncs.c | 65 ++++++++++++++++++--------
contrib/pageinspect/expected/btree.out | 2 +-
2 files changed, 47 insertions(+), 20 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 17f7ad186e..4eab8df098 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -27,6 +27,7 @@
#include "postgres.h"
+#include "access/genam.h"
#include "access/nbtree.h"
#include "access/relation.h"
#include "catalog/namespace.h"
@@ -245,6 +246,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
*/
struct user_args
{
+ Relation rel;
Page page;
OffsetNumber offset;
bool leafpage;
@@ -261,6 +263,7 @@ struct user_args
static Datum
bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
{
+ Relation rel = uargs->rel;
Page page = uargs->page;
OffsetNumber offset = uargs->offset;
bool leafpage = uargs->leafpage;
@@ -295,26 +298,48 @@ bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
- dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
-
- /*
- * Make sure that "data" column does not include posting list or pivot
- * tuple representation of heap TID
- */
- if (BTreeTupleIsPosting(itup))
- dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
- else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
- dlen -= MAXALIGN(sizeof(ItemPointerData));
-
- dump = palloc0(dlen * 3 + 1);
- datacstring = dump;
- for (off = 0; off < dlen; off++)
+ if (rel)
{
- if (off > 0)
- *dump++ = ' ';
- sprintf(dump, "%02x", *(ptr + off) & 0xff);
- dump += 2;
+ TupleDesc itupdesc = RelationGetDescr(rel);
+ Datum datvalues[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ int natts;
+ int indnkeyatts = rel->rd_index->indnkeyatts;
+
+ natts = BTreeTupleGetNAtts(itup, rel);
+
+ itupdesc->natts = Min(indnkeyatts, natts);
+ memset(&isnull, 0xFF, sizeof(isnull));
+ index_deform_tuple(itup, itupdesc, datvalues, isnull);
+ rel->rd_index->indnkeyatts = natts;
+ datacstring = BuildIndexValueDescription(rel, datvalues, isnull);
+ itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+ rel->rd_index->indnkeyatts = indnkeyatts;
}
+ else
+ {
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+ /*
+ * Make sure that "data" column does not include posting list or pivot
+ * tuple representation of heap TID
+ */
+ if (BTreeTupleIsPosting(itup))
+ dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+ else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+ dlen -= MAXALIGN(sizeof(ItemPointerData));
+
+ dump = palloc0(dlen * 3 + 1);
+ datacstring = dump;
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+ }
+
values[j++] = CStringGetTextDatum(datacstring);
pfree(datacstring);
@@ -437,11 +462,11 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = rel;
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
UnlockReleaseBuffer(buffer);
- relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
@@ -475,6 +500,7 @@ bt_page_items(PG_FUNCTION_ARGS)
}
else
{
+ relation_close(uargs->rel, AccessShareLock);
pfree(uargs->page);
pfree(uargs);
SRF_RETURN_DONE(fctx);
@@ -522,6 +548,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
uargs = palloc(sizeof(struct user_args));
+ uargs->rel = NULL;
uargs->page = VARDATA(raw_page);
uargs->offset = FirstOffsetNumber;
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 1d45cd5c1e..3da5f37c3e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -40,7 +40,7 @@ ctid | (0,1)
itemlen | 16
nulls | f
vars | f
-data | 01 00 00 00 00 00 00 01
+data | (a)=(72057594037927937)
dead | f
htid | (0,1)
tids |
--
2.17.1
v25-0003-Teach-pageinspect-about-nbtree-posting-lists.patchapplication/x-patch; name=v25-0003-Teach-pageinspect-about-nbtree-posting-lists.patchDownload
From 826ac5d3ffc05285a7549d64e47472f73231fe40 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 10 Sep 2018 19:53:51 -0700
Subject: [PATCH v25 3/4] Teach pageinspect about nbtree posting lists.
Add a column for posting list TIDs to bt_page_items(). Also add a
column that displays a single heap TID value for each tuple, regardless
of whether or not "ctid" is used for heap TID. In the case of posting
list tuples, the value is the lowest heap TID in the posting list.
Arguably I should have done this when commit dd299df8 went in, since
that added a pivot tuple representation that could have a heap TID but
didn't use ctid for that purpose.
Also add a boolean column that displays the LP_DEAD bit value for each
non-pivot tuple.
No version bump for the pageinspect extension, since there hasn't been a
stable release since the last version bump (see commit 58b4cb30).
---
contrib/pageinspect/btreefuncs.c | 111 +++++++++++++++---
contrib/pageinspect/expected/btree.out | 6 +
contrib/pageinspect/pageinspect--1.7--1.8.sql | 36 ++++++
doc/src/sgml/pageinspect.sgml | 80 +++++++------
4 files changed, 181 insertions(+), 52 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 78cdc69ec7..17f7ad186e 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -31,9 +31,11 @@
#include "access/relation.h"
#include "catalog/namespace.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_type.h"
#include "funcapi.h"
#include "miscadmin.h"
#include "pageinspect.h"
+#include "utils/array.h"
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/varlena.h"
@@ -45,6 +47,8 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
+#define ItemPointerGetDatum(X) PointerGetDatum(X)
/* note: BlockNumber is unsigned, hence can't be negative */
#define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -243,6 +247,9 @@ struct user_args
{
Page page;
OffsetNumber offset;
+ bool leafpage;
+ bool rightmost;
+ TupleDesc tupd;
};
/*-------------------------------------------------------
@@ -252,17 +259,25 @@ struct user_args
* ------------------------------------------------------
*/
static Datum
-bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
+bt_page_print_tuples(FuncCallContext *fctx, struct user_args *uargs)
{
- char *values[6];
+ Page page = uargs->page;
+ OffsetNumber offset = uargs->offset;
+ bool leafpage = uargs->leafpage;
+ bool rightmost = uargs->rightmost;
+ bool pivotoffset;
+ Datum values[9];
+ bool nulls[9];
HeapTuple tuple;
ItemId id;
IndexTuple itup;
int j;
int off;
int dlen;
- char *dump;
+ char *dump,
+ *datacstring;
char *ptr;
+ ItemPointer htid;
id = PageGetItemId(page, offset);
@@ -272,18 +287,27 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
itup = (IndexTuple) PageGetItem(page, id);
j = 0;
- values[j++] = psprintf("%d", offset);
- values[j++] = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&itup->t_tid),
- ItemPointerGetOffsetNumberNoCheck(&itup->t_tid));
- values[j++] = psprintf("%d", (int) IndexTupleSize(itup));
- values[j++] = psprintf("%c", IndexTupleHasNulls(itup) ? 't' : 'f');
- values[j++] = psprintf("%c", IndexTupleHasVarwidths(itup) ? 't' : 'f');
+ memset(nulls, 0, sizeof(nulls));
+ values[j++] = DatumGetInt16(offset);
+ values[j++] = ItemPointerGetDatum(&itup->t_tid);
+ values[j++] = Int32GetDatum((int) IndexTupleSize(itup));
+ values[j++] = BoolGetDatum(IndexTupleHasNulls(itup));
+ values[j++] = BoolGetDatum(IndexTupleHasVarwidths(itup));
ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+
+ /*
+ * Make sure that "data" column does not include posting list or pivot
+ * tuple representation of heap TID
+ */
+ if (BTreeTupleIsPosting(itup))
+ dlen -= IndexTupleSize(itup) - BTreeTupleGetPostingOffset(itup);
+ else if (BTreeTupleIsPivot(itup) && BTreeTupleGetHeapTID(itup) != NULL)
+ dlen -= MAXALIGN(sizeof(ItemPointerData));
+
dump = palloc0(dlen * 3 + 1);
- values[j] = dump;
+ datacstring = dump;
for (off = 0; off < dlen; off++)
{
if (off > 0)
@@ -291,8 +315,57 @@ bt_page_print_tuples(FuncCallContext *fctx, Page page, OffsetNumber offset)
sprintf(dump, "%02x", *(ptr + off) & 0xff);
dump += 2;
}
+ values[j++] = CStringGetTextDatum(datacstring);
+ pfree(datacstring);
- tuple = BuildTupleFromCStrings(fctx->attinmeta, values);
+ /*
+ * Avoid indicating that pivot tuple from !heapkeyspace index (which won't
+ * have v4+ status bit set) is dead or has a heap TID -- that can only
+ * happen with non-pivot tuples. (Most backend code can use the
+ * heapkeyspace field from the metapage to figure out which representation
+ * to expect, but we have to be a bit creative here.)
+ */
+ pivotoffset = (!leafpage || (!rightmost && offset == P_HIKEY));
+
+ /* LP_DEAD status bit */
+ if (!pivotoffset)
+ values[j++] = BoolGetDatum(ItemIdIsDead(id));
+ else
+ nulls[j++] = true;
+
+ htid = BTreeTupleGetHeapTID(itup);
+ if (pivotoffset && !BTreeTupleIsPivot(itup))
+ htid = NULL;
+
+ if (htid)
+ values[j++] = ItemPointerGetDatum(htid);
+ else
+ nulls[j++] = true;
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* build an array of item pointers */
+ ItemPointer tids;
+ Datum *tids_datum;
+ int nposting;
+
+ tids = BTreeTupleGetPosting(itup);
+ nposting = BTreeTupleGetNPosting(itup);
+ tids_datum = (Datum *) palloc(nposting * sizeof(Datum));
+ for (int i = 0; i < nposting; i++)
+ tids_datum[i] = ItemPointerGetDatum(&tids[i]);
+ values[j++] = PointerGetDatum(construct_array(tids_datum,
+ nposting,
+ TIDOID,
+ sizeof(ItemPointerData),
+ false, 's'));
+ pfree(tids_datum);
+ }
+ else
+ nulls[j++] = true;
+
+ /* Build and return the result tuple */
+ tuple = heap_form_tuple(uargs->tupd, values, nulls);
return HeapTupleGetDatum(tuple);
}
@@ -378,12 +451,13 @@ bt_page_items(PG_FUNCTION_ARGS)
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+ uargs->leafpage = P_ISLEAF(opaque);
+ uargs->rightmost = P_RIGHTMOST(opaque);
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
-
- fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+ uargs->tupd = tupleDesc;
fctx->user_fctx = uargs;
@@ -395,7 +469,7 @@ bt_page_items(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
@@ -463,12 +537,13 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
elog(NOTICE, "page is deleted");
fctx->max_calls = PageGetMaxOffsetNumber(uargs->page);
+ uargs->leafpage = P_ISLEAF(opaque);
+ uargs->rightmost = P_RIGHTMOST(opaque);
/* Build a tuple descriptor for our result type */
if (get_call_result_type(fcinfo, NULL, &tupleDesc) != TYPEFUNC_COMPOSITE)
elog(ERROR, "return type must be a row type");
-
- fctx->attinmeta = TupleDescGetAttInMetadata(tupleDesc);
+ uargs->tupd = tupleDesc;
fctx->user_fctx = uargs;
@@ -480,7 +555,7 @@ bt_page_items_bytea(PG_FUNCTION_ARGS)
if (fctx->call_cntr < fctx->max_calls)
{
- result = bt_page_print_tuples(fctx, uargs->page, uargs->offset);
+ result = bt_page_print_tuples(fctx, uargs);
uargs->offset++;
SRF_RETURN_NEXT(fctx, result);
}
diff --git a/contrib/pageinspect/expected/btree.out b/contrib/pageinspect/expected/btree.out
index 07c2dcd771..1d45cd5c1e 100644
--- a/contrib/pageinspect/expected/btree.out
+++ b/contrib/pageinspect/expected/btree.out
@@ -41,6 +41,9 @@ itemlen | 16
nulls | f
vars | f
data | 01 00 00 00 00 00 00 01
+dead | f
+htid | (0,1)
+tids |
SELECT * FROM bt_page_items('test1_a_idx', 2);
ERROR: block number out of range
@@ -54,6 +57,9 @@ itemlen | 16
nulls | f
vars | f
data | 01 00 00 00 00 00 00 01
+dead | f
+htid | (0,1)
+tids |
SELECT * FROM bt_page_items(get_raw_page('test1_a_idx', 2));
ERROR: block number 2 is out of range for relation "test1_a_idx"
diff --git a/contrib/pageinspect/pageinspect--1.7--1.8.sql b/contrib/pageinspect/pageinspect--1.7--1.8.sql
index 2a7c4b3516..70f1ab0467 100644
--- a/contrib/pageinspect/pageinspect--1.7--1.8.sql
+++ b/contrib/pageinspect/pageinspect--1.7--1.8.sql
@@ -14,3 +14,39 @@ CREATE FUNCTION heap_tuple_infomask_flags(
RETURNS record
AS 'MODULE_PATHNAME', 'heap_tuple_infomask_flags'
LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(text, int4)
+--
+DROP FUNCTION bt_page_items(text, int4);
+CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT dead boolean,
+ OUT htid tid,
+ OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_page_items(bytea)
+--
+DROP FUNCTION bt_page_items(bytea);
+CREATE FUNCTION bt_page_items(IN page bytea,
+ OUT itemoffset smallint,
+ OUT ctid tid,
+ OUT itemlen smallint,
+ OUT nulls bool,
+ OUT vars bool,
+ OUT data text,
+ OUT dead boolean,
+ OUT htid tid,
+ OUT tids tid[])
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'bt_page_items_bytea'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 7e2e1487d7..1763e9c6f0 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -329,11 +329,11 @@ test=# SELECT * FROM bt_page_stats('pg_cast_oid_index', 1);
-[ RECORD 1 ]-+-----
blkno | 1
type | l
-live_items | 256
+live_items | 224
dead_items | 0
-avg_item_size | 12
+avg_item_size | 16
page_size | 8192
-free_size | 4056
+free_size | 3668
btpo_prev | 0
btpo_next | 0
btpo | 0
@@ -356,33 +356,45 @@ btpo_flags | 3
<function>bt_page_items</function> returns detailed information about
all of the items on a B-tree index page. For example:
<screen>
-test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
- itemoffset | ctid | itemlen | nulls | vars | data
-------------+---------+---------+-------+------+-------------
- 1 | (0,1) | 12 | f | f | 23 27 00 00
- 2 | (0,2) | 12 | f | f | 24 27 00 00
- 3 | (0,3) | 12 | f | f | 25 27 00 00
- 4 | (0,4) | 12 | f | f | 26 27 00 00
- 5 | (0,5) | 12 | f | f | 27 27 00 00
- 6 | (0,6) | 12 | f | f | 28 27 00 00
- 7 | (0,7) | 12 | f | f | 29 27 00 00
- 8 | (0,8) | 12 | f | f | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items('tenk2_unique1', 5);
+ itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+ 1 | (40,1) | 16 | f | f | b8 05 00 00 00 00 00 00 | | |
+ 2 | (58,11) | 16 | f | f | 4a 04 00 00 00 00 00 00 | f | (58,11) |
+ 3 | (266,4) | 16 | f | f | 4b 04 00 00 00 00 00 00 | f | (266,4) |
+ 4 | (279,25) | 16 | f | f | 4c 04 00 00 00 00 00 00 | f | (279,25) |
+ 5 | (333,11) | 16 | f | f | 4d 04 00 00 00 00 00 00 | f | (333,11) |
+ 6 | (87,24) | 16 | f | f | 4e 04 00 00 00 00 00 00 | f | (87,24) |
+ 7 | (38,22) | 16 | f | f | 4f 04 00 00 00 00 00 00 | f | (38,22) |
+ 8 | (272,17) | 16 | f | f | 50 04 00 00 00 00 00 00 | f | (272,17) |
</screen>
- In a B-tree leaf page, <structfield>ctid</structfield> points to a heap tuple.
- In an internal page, the block number part of <structfield>ctid</structfield>
- points to another page in the index itself, while the offset part
- (the second number) is ignored and is usually 1.
+ In a B-tree leaf page, <structfield>ctid</structfield> usually
+ points to a heap tuple, and <structfield>dead</structfield> may
+ indicate that the item has its <literal>LP_DEAD</literal> bit
+ set. In an internal page, the block number part of
+ <structfield>ctid</structfield> points to another page in the
+ index itself, while the offset part (the second number) encodes
+ metadata about the tuple. Posting list tuples on leaf pages
+ also use <structfield>ctid</structfield> for metadata.
+ <structfield>htid</structfield> always shows a single heap TID
+ for the tuple, regardless of how it is represented (internal
+ page tuples may need to store a heap TID when there are many
+ duplicate tuples on descendent leaf pages).
+ <structfield>tids</structfield> is a list of TIDs that is stored
+ within posting list tuples (tuples created by deduplication).
</para>
<para>
Note that the first item on any non-rightmost page (any page with
a non-zero value in the <structfield>btpo_next</structfield> field) is the
page's <quote>high key</quote>, meaning its <structfield>data</structfield>
serves as an upper bound on all items appearing on the page, while
- its <structfield>ctid</structfield> field is meaningless. Also, on non-leaf
- pages, the first real data item (the first item that is not a high
- key) is a <quote>minus infinity</quote> item, with no actual value
- in its <structfield>data</structfield> field. Such an item does have a valid
- downlink in its <structfield>ctid</structfield> field, however.
+ its <structfield>ctid</structfield> field does not point to
+ another block. Also, on non-leaf pages, the first real data item
+ (the first item that is not a high key) is a <quote>minus
+ infinity</quote> item, with no actual value in its
+ <structfield>data</structfield> field. Such an item does have a
+ valid downlink in its <structfield>ctid</structfield> field,
+ however.
</para>
</listitem>
</varlistentry>
@@ -402,17 +414,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
with <function>get_raw_page</function> should be passed as argument. So
the last example could also be rewritten like this:
<screen>
-test=# SELECT * FROM bt_page_items(get_raw_page('pg_cast_oid_index', 1));
- itemoffset | ctid | itemlen | nulls | vars | data
-------------+---------+---------+-------+------+-------------
- 1 | (0,1) | 12 | f | f | 23 27 00 00
- 2 | (0,2) | 12 | f | f | 24 27 00 00
- 3 | (0,3) | 12 | f | f | 25 27 00 00
- 4 | (0,4) | 12 | f | f | 26 27 00 00
- 5 | (0,5) | 12 | f | f | 27 27 00 00
- 6 | (0,6) | 12 | f | f | 28 27 00 00
- 7 | (0,7) | 12 | f | f | 29 27 00 00
- 8 | (0,8) | 12 | f | f | 2a 27 00 00
+regression=# SELECT * FROM bt_page_items(get_raw_page('tenk2_unique1', 5));
+ itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
+------------+----------+---------+-------+------+-------------------------+------+----------+------
+ 1 | (40,1) | 16 | f | f | b8 05 00 00 00 00 00 00 | | |
+ 2 | (58,11) | 16 | f | f | 4a 04 00 00 00 00 00 00 | f | (58,11) |
+ 3 | (266,4) | 16 | f | f | 4b 04 00 00 00 00 00 00 | f | (266,4) |
+ 4 | (279,25) | 16 | f | f | 4c 04 00 00 00 00 00 00 | f | (279,25) |
+ 5 | (333,11) | 16 | f | f | 4d 04 00 00 00 00 00 00 | f | (333,11) |
+ 6 | (87,24) | 16 | f | f | 4e 04 00 00 00 00 00 00 | f | (87,24) |
+ 7 | (38,22) | 16 | f | f | 4f 04 00 00 00 00 00 00 | f | (38,22) |
+ 8 | (272,17) | 16 | f | f | 50 04 00 00 00 00 00 00 | f | (272,17) |
</screen>
All the other details are the same as explained in the previous item.
</para>
--
2.17.1
v25-0002-Add-deduplication-to-nbtree.patchapplication/x-patch; name=v25-0002-Add-deduplication-to-nbtree.patchDownload
From 0efdbf168fc324b0173cbc1a2019c4748d5f312a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 25 Sep 2019 10:08:53 -0700
Subject: [PATCH v25 2/4] Add deduplication to nbtree
---
src/include/access/nbtree.h | 329 +++++++-
src/include/access/nbtxlog.h | 71 +-
src/include/access/rmgrlist.h | 2 +-
src/backend/access/common/reloptions.c | 9 +
src/backend/access/index/genam.c | 4 +
src/backend/access/nbtree/Makefile | 1 +
src/backend/access/nbtree/README | 74 +-
src/backend/access/nbtree/nbtdedup.c | 715 ++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c | 343 ++++++++-
src/backend/access/nbtree/nbtpage.c | 238 +++++-
src/backend/access/nbtree/nbtree.c | 167 +++-
src/backend/access/nbtree/nbtsearch.c | 250 +++++-
src/backend/access/nbtree/nbtsort.c | 204 ++++-
src/backend/access/nbtree/nbtsplitloc.c | 36 +-
src/backend/access/nbtree/nbtutils.c | 204 ++++-
src/backend/access/nbtree/nbtxlog.c | 236 +++++-
src/backend/access/rmgrdesc/nbtdesc.c | 25 +-
src/backend/utils/misc/guc.c | 28 +
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/bin/psql/tab-complete.c | 4 +-
contrib/amcheck/verify_nbtree.c | 180 ++++-
doc/src/sgml/btree.sgml | 123 ++-
doc/src/sgml/charset.sgml | 9 +-
doc/src/sgml/config.sgml | 33 +
doc/src/sgml/maintenance.sgml | 8 +
doc/src/sgml/ref/create_index.sgml | 44 +-
doc/src/sgml/ref/reindex.sgml | 5 +-
src/test/regress/expected/btree_index.out | 16 +
src/test/regress/sql/btree_index.sql | 17 +
29 files changed, 3131 insertions(+), 245 deletions(-)
create mode 100644 src/backend/access/nbtree/nbtdedup.c
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9833cc10bd..1482d5ab1a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -24,6 +24,17 @@
#include "storage/bufmgr.h"
#include "storage/shm_toc.h"
+/* deduplication GUC modes */
+typedef enum DeduplicationMode
+{
+ DEDUP_OFF = 0, /* disabled */
+ DEDUP_ON, /* enabled generally */
+ DEDUP_NONUNIQUE /* enabled with non-unique indexes only (default) */
+} DeduplicationMode;
+
+/* GUC parameter */
+extern int btree_deduplication;
+
/* There's room for a 16-bit vacuum cycle ID in BTPageOpaqueData */
typedef uint16 BTCycleId;
@@ -108,6 +119,7 @@ typedef struct BTMetaPageData
* pages */
float8 btm_last_cleanup_num_heap_tuples; /* number of heap tuples
* during last cleanup */
+ bool btm_safededup; /* deduplication known to be safe? */
} BTMetaPageData;
#define BTPageGetMeta(p) \
@@ -115,7 +127,8 @@ typedef struct BTMetaPageData
/*
* The current Btree version is 4. That's what you'll get when you create
- * a new index.
+ * a new index. The btm_safededup field can only be set if this happened
+ * on Postgres 13, but it's safe to read with version 3 indexes.
*
* Btree version 3 was used in PostgreSQL v11. It is mostly the same as
* version 4, but heap TIDs were not part of the keyspace. Index tuples
@@ -132,8 +145,8 @@ typedef struct BTMetaPageData
#define BTREE_METAPAGE 0 /* first page is meta */
#define BTREE_MAGIC 0x053162 /* magic number in metapage */
#define BTREE_VERSION 4 /* current version number */
-#define BTREE_MIN_VERSION 2 /* minimal supported version number */
-#define BTREE_NOVAC_VERSION 3 /* minimal version with all meta fields */
+#define BTREE_MIN_VERSION 2 /* minimum supported version */
+#define BTREE_NOVAC_VERSION 3 /* version with all meta fields set */
/*
* Maximum size of a btree index entry, including its tuple header.
@@ -155,6 +168,26 @@ typedef struct BTMetaPageData
MAXALIGN_DOWN((PageGetPageSize(page) - \
MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+/*
+ * MaxBTreeIndexTuplesPerPage is an upper bound on the number of "logical"
+ * tuples that may be stored on a btree leaf page. This is comparable to
+ * the generic/physical MaxIndexTuplesPerPage upper bound. A separate
+ * upper bound is needed in certain contexts due to posting list tuples,
+ * which only use a single physical page entry to store many logical
+ * tuples. (MaxBTreeIndexTuplesPerPage is used to size the per-page
+ * temporary buffers used by index scans.)
+ *
+ * Note: we don't bother considering per-physical-tuple overheads here to
+ * keep things simple (value is based on how many elements a single array
+ * of heap TIDs must have to fill the space between the page header and
+ * special area). The value is slightly higher (i.e. more conservative)
+ * than necessary as a result, which is considered acceptable. There will
+ * only be three (very large) physical posting list tuples in leaf pages
+ * that have the largest possible number of heap TIDs/logical tuples.
+ */
+#define MaxBTreeIndexTuplesPerPage \
+ (int) ((BLCKSZ - SizeOfPageHeaderData - sizeof(BTPageOpaqueData)) / \
+ sizeof(ItemPointerData))
/*
* The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -230,16 +263,15 @@ typedef struct BTMetaPageData
* tuples (non-pivot tuples). _bt_check_natts() enforces the rules
* described here.
*
- * Non-pivot tuple format:
+ * Non-pivot tuple format (plain/non-posting variant):
*
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
- * All other types of index tuples ("pivot" tuples) only have key columns,
- * since pivot tuples only exist to represent how the key space is
+ * Non-pivot tuples complement pivot tuples, which only have key columns.
+ * The sole purpose of pivot tuples is to represent how the key space is
* separated. In general, any B-Tree index that has more than one level
* (i.e. any index that does not just consist of a metapage and a single
* leaf root page) must have some number of pivot tuples, since pivot
@@ -283,20 +315,103 @@ typedef struct BTMetaPageData
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
*
+ * Sometimes non-pivot tuples also use a representation that repurposes
+ * t_tid to store metadata rather than a TID. Postgres 13 introduced a new
+ * non-pivot tuple format to support deduplication: posting list tuples.
+ * Deduplication folds together multiple equal non-pivot tuples into a
+ * logically equivalent, space efficient representation. A posting list is
+ * an array of ItemPointerData elements. Regular non-pivot tuples are
+ * merged together to form posting list tuples lazily, at the point where
+ * we'd otherwise have to split a leaf page.
+ *
+ * Posting tuple format (alternative non-pivot tuple representation):
+ *
+ * t_tid | t_info | key values | posting list (TID array)
+ *
+ * Posting list tuples are recognized as such by having the
+ * INDEX_ALT_TID_MASK status bit set in t_info and the BT_IS_POSTING status
+ * bit set in t_tid. These flags redefine the content of the posting
+ * tuple's t_tid to store an offset to the posting list, as well as the
+ * total number of posting list array elements.
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items present in the tuple, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use. Like any non-pivot tuple, the number of columns stored is
+ * always implicitly the total number in the index (in practice there can
+ * never be non-key columns stored, since deduplication is not supported
+ * with INCLUDE indexes).
+ *
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically. BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ Assert(BTreeTupleIsPosting(itup)); \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (BTreeTupleGetPosting(itup) + (n))
/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
@@ -327,40 +442,69 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (BTreeTupleIsPivot(itup)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
: \
IndexRelationGetNumberOfAttributes(rel) \
)
-#define BTreeTupleSetNAtts(itup, n) \
- do { \
- (itup)->t_info |= INDEX_ALT_TID_MASK; \
- ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
- } while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+ Assert(!BTreeTupleIsPosting(itup));
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
/*
- * Get tiebreaker heap TID attribute, if any. Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
*/
-#define BTreeTupleGetHeapTID(itup) \
- ( \
- (itup)->t_info & INDEX_ALT_TID_MASK && \
- (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
- ( \
- (ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
- sizeof(ItemPointerData)) \
- ) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
- )
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+ if (BTreeTupleIsPivot(itup))
+ {
+ /* Pivot tuple heap TID representation? */
+ if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+ BT_HEAP_TID_ATTR) != 0)
+ return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+ sizeof(ItemPointerData));
+
+ /* Heap TID attribute was truncated */
+ return NULL;
+ }
+ else if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup);
+
+ return &(itup->t_tid);
+}
+
/*
- * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple. Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxHeapTID(IndexTuple itup)
+{
+ Assert(!BTreeTupleIsPivot(itup));
+
+ if (BTreeTupleIsPosting(itup))
+ return BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup) - 1);
+
+ return &(itup->t_tid);
+}
+
+/*
+ * Set the heap TID attribute for a pivot tuple
*/
#define BTreeTupleSetAltHeapTID(itup) \
do { \
- Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(BTreeTupleIsPivot(itup)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -435,6 +579,11 @@ typedef BTStackData *BTStack;
* indexes whose version is >= version 4. It's convenient to keep this close
* by, rather than accessing the metapage repeatedly.
*
+ * safededup is set to indicate that index may use dynamic deduplication
+ * safely (index storage parameter separately indicates if deduplication is
+ * currently in use). This is also a property of the index relation rather
+ * than an indexscan that is kept around for convenience.
+ *
* anynullkeys indicates if any of the keys had NULL value when scankey was
* built from index tuple (note that already-truncated tuple key attributes
* set NULL as a placeholder key value, which also affects value of
@@ -470,6 +619,7 @@ typedef BTStackData *BTStack;
typedef struct BTScanInsertData
{
bool heapkeyspace;
+ bool safededup;
bool anynullkeys;
bool nextkey;
bool pivotsearch;
@@ -508,10 +658,70 @@ typedef struct BTInsertStateData
bool bounds_valid;
OffsetNumber low;
OffsetNumber stricthigh;
+
+ /*
+ * if _bt_binsrch_insert found the location inside existing posting list,
+ * save the position inside the list. This will be -1 in rare cases where
+ * the overlapping posting list is LP_DEAD.
+ */
+ int postingoff;
} BTInsertStateData;
typedef BTInsertStateData *BTInsertState;
+/*
+ * State used to representing a pending posting list during deduplication.
+ *
+ * Each entry represents a group of consecutive items from the page, starting
+ * from page offset number 'baseoff', which is the offset number of the "base"
+ * tuple on the page undergoing deduplication. 'nitems' is the total number
+ * of items from the page that will be merged to make a new posting tuple.
+ *
+ * Note: 'nitems' means the number of physical index tuples/line pointers on
+ * the page, starting with and including the item at offset number 'baseoff'
+ * (so nitems should be at least 2 when interval is used). These existing
+ * tuples may be posting list tuples or regular tuples.
+ */
+typedef struct BTDedupInterval
+{
+ OffsetNumber baseoff;
+ uint16 nitems;
+} BTDedupInterval;
+
+/*
+ * Btree-private state used to deduplicate items on a leaf page
+ */
+typedef struct BTDedupStateData
+{
+ Relation rel;
+ /* Deduplication status info for entire page/operation */
+ Size maxitemsize; /* Limit on size of final tuple */
+ IndexTuple newitem;
+ bool checkingunique; /* Use unique index strategy? */
+ OffsetNumber skippedbase; /* First offset skipped by checkingunique */
+
+ /* Metadata about current pending posting list */
+ ItemPointer htids; /* Heap TIDs in pending posting list */
+ int nhtids; /* # heap TIDs in nhtids array */
+ int nitems; /* See BTDedupInterval definition */
+ Size alltupsize; /* Includes line pointer overhead */
+ bool overlap; /* Avoid overlapping posting lists? */
+
+ /* Metadata about base tuple of current pending posting list */
+ IndexTuple base; /* Use to form new posting list */
+ OffsetNumber baseoff; /* page offset of base */
+ Size basetupsize; /* base size without posting list */
+
+ /*
+ * Pending posting list. Contains information about a group of
+ * consecutive items that will be deduplicated by creating a new posting
+ * list tuple.
+ */
+ BTDedupInterval interval;
+} BTDedupStateData;
+
+typedef BTDedupStateData *BTDedupState;
+
/*
* BTScanOpaqueData is the btree-private state needed for an indexscan.
* This consists of preprocessed scan keys (see _bt_preprocess_keys() for
@@ -535,7 +745,10 @@ typedef BTInsertStateData *BTInsertState;
* If we are doing an index-only scan, we save the entire IndexTuple for each
* matched item, otherwise only its heap TID and offset. The IndexTuples go
* into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array. Posting list tuples store a "base" tuple once,
+ * allowing the same key to be returned for each logical tuple associated
+ * with the physical posting list tuple (i.e. for each TID from the posting
+ * list).
*/
typedef struct BTScanPosItem /* what we remember about each match */
@@ -568,6 +781,12 @@ typedef struct BTScanPosData
*/
int nextTupleOffset;
+ /*
+ * Posting list tuples use postingTupleOffset to store the current
+ * location of the tuple that is returned multiple times.
+ */
+ int postingTupleOffset;
+
/*
* The items array is always ordered in index order (ie, increasing
* indexoffset). When scanning backwards it is convenient to fill the
@@ -579,7 +798,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxBTreeIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -687,6 +906,7 @@ typedef struct BTOptions
int fillfactor; /* page fill factor in percent (0..100) */
/* fraction of newly inserted tuples prior to trigger index cleanup */
float8 vacuum_cleanup_index_scale_factor;
+ bool deduplication; /* Use deduplication where safe? */
} BTOptions;
#define BTGetFillFactor(relation) \
@@ -695,8 +915,18 @@ typedef struct BTOptions
(relation)->rd_options ? \
((BTOptions *) (relation)->rd_options)->fillfactor : \
BTREE_DEFAULT_FILLFACTOR)
+#define BTGetUseDedup(relation) \
+ (AssertMacro(relation->rd_rel->relkind == RELKIND_INDEX && \
+ relation->rd_rel->relam == BTREE_AM_OID), \
+ ((relation)->rd_options ? \
+ ((BTOptions *) (relation)->rd_options)->deduplication : \
+ BTGetUseDedupGUC(relation)))
#define BTGetTargetPageFreeSpace(relation) \
(BLCKSZ * (100 - BTGetFillFactor(relation)) / 100)
+#define BTGetUseDedupGUC(relation) \
+ (relation->rd_index->indisunique ? \
+ btree_deduplication == DEDUP_ON : \
+ btree_deduplication != DEDUP_OFF)
/*
* Constant definition for progress reporting. Phase numbers must match
@@ -743,6 +973,22 @@ extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
extern void _bt_parallel_done(IndexScanDesc scan);
extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+/*
+ * prototypes for functions in nbtdedup.c
+ */
+extern void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz,
+ bool checkingunique);
+extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+ OffsetNumber base_off);
+extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
+extern Size _bt_dedup_finish_pending(Buffer buffer, BTDedupState state,
+ bool need_wal);
+extern IndexTuple _bt_form_posting(IndexTuple tuple, ItemPointer htids,
+ int nhtids);
+extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
+ int postingoff);
+
/*
* prototypes for functions in nbtinsert.c
*/
@@ -761,7 +1007,8 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page page,
/*
* prototypes for functions in nbtpage.c
*/
-extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level);
+extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup);
extern void _bt_update_meta_cleanup_info(Relation rel,
TransactionId oldestBtpoXact, float8 numHeapTuples);
extern void _bt_upgrademetapage(Page page);
@@ -769,6 +1016,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
extern Buffer _bt_gettrueroot(Relation rel);
extern int _bt_getrootheight(Relation rel);
extern bool _bt_heapkeyspace(Relation rel);
+extern bool _bt_safededup(Relation rel);
extern void _bt_checkpage(Relation rel, Buffer buf);
extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
@@ -779,7 +1027,9 @@ extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *deletable, int ndeletable);
+ OffsetNumber *deletable, int ndeletable,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdateable);
extern int _bt_pagedel(Relation rel, Buffer buf);
/*
@@ -829,6 +1079,7 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern bool _bt_opclasses_support_dedup(Relation index);
/*
* prototypes for functions in nbtvalidate.c
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 71435a13b3..d387905cc0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
#define XLOG_BTREE_INSERT_META 0x20 /* same, plus update metapage */
#define XLOG_BTREE_SPLIT_L 0x30 /* add index tuple with split */
#define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE 0x50 /* deduplicate tuples on leaf page */
+#define XLOG_BTREE_INSERT_POST 0x60 /* add index tuple with posting split */
#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */
#define XLOG_BTREE_UNLINK_PAGE 0x80 /* delete a half-dead page */
#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
@@ -53,21 +54,32 @@ typedef struct xl_btree_metadata
uint32 fastlevel;
TransactionId oldest_btpo_xact;
float8 last_cleanup_num_heap_tuples;
+ bool btm_safededup;
} xl_btree_metadata;
/*
* This is what we need to know about simple (without split) insert.
*
- * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
- * Note that INSERT_META implies it's not a leaf page.
+ * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META, and
+ * INSERT_POST. Note that INSERT_META and INSERT_UPPER implies it's not a
+ * leaf page, while INSERT_POST and INSERT_LEAF imply that it is.
*
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page
* Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
* Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
+ * Note: The new tuple is actually the "original" new item in the posting
+ * list split insert case (i.e. the INSERT_POST case). A split offset for
+ * the posting list is logged before the original new item. Recovery needs
+ * both, since it must do an in-place update of the existing posting list
+ * that was split as an extra step. Also, recovery generates a "final"
+ * newitem. See _bt_swap_posting().
*/
typedef struct xl_btree_insert
{
OffsetNumber offnum;
+ /* posting split offset (INSERT_POST only) */
+ /* new tuple that was inserted (or orignewitem in INSERT_POST case) */
} xl_btree_insert;
#define SizeOfBtreeInsert (offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
@@ -91,9 +103,18 @@ typedef struct xl_btree_insert
*
* Backup Blk 0: original page / new left page
*
- * The left page's data portion contains the new item, if it's the _L variant.
- * An IndexTuple representing the high key of the left page must follow with
- * either variant.
+ * The left page's data portion contains the new item, if it's the _L variant
+ * (though _R variant page split records with a posting list split sometimes
+ * need to include newitem). An IndexTuple representing the high key of the
+ * left page must follow in all cases.
+ *
+ * The newitem is actually an "original" newitem when a posting list split
+ * occurs that requires than the original posting list be updated in passing.
+ * Recovery recognizes this case when postingoff is set. This corresponds to
+ * the xl_btree_insert INSERT_POST case. Note that postingoff will be set to
+ * zero (no posting split) when a posting list split occurs where both
+ * original posting list and newitem go on the right page, since recovery
+ * doesn't need to consider the posting list split at all.
*
* Backup Blk 1: new right page
*
@@ -111,10 +132,26 @@ typedef struct xl_btree_split
{
uint32 level; /* tree level of page being split */
OffsetNumber firstright; /* first item moved to right page */
- OffsetNumber newitemoff; /* new item's offset (useful for _L variant) */
+ OffsetNumber newitemoff; /* new item's offset */
+ uint16 postingoff; /* offset inside orig posting tuple */
} xl_btree_split;
-#define SizeOfBtreeSplit (offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit (offsetof(xl_btree_split, postingoff) + sizeof(uint16))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys are
+ * merged together into posting list tuples.
+ *
+ * The WAL record represents the interval that describes the posing tuple
+ * that should be added to the page.
+ */
+typedef struct xl_btree_dedup
+{
+ OffsetNumber baseoff;
+ uint16 nitems;
+} xl_btree_dedup;
+
+#define SizeOfBtreeDedup (offsetof(xl_btree_dedup, nitems) + sizeof(uint16))
/*
* This is what we need to know about delete of individual leaf index tuples.
@@ -148,19 +185,25 @@ typedef struct xl_btree_reuse_page
/*
* This is what we need to know about vacuum of individual leaf index tuples.
* The WAL record can represent deletion of any number of index tuples on a
- * single index page when executed by VACUUM.
+ * single index page when executed by VACUUM. It can also support "updates"
+ * of index tuples, which are actually deletions of "logical" tuples contained
+ * in an existing posting list tuple that will still have some remaining
+ * logical tuples once VACUUM finishes.
*
* Note that the WAL record in any vacuum of an index must have at least one
- * item to delete.
+ * item to delete or update.
*/
typedef struct xl_btree_vacuum
{
- uint32 ndeleted;
+ uint16 ndeleted;
+ uint16 nupdated;
/* DELETED TARGET OFFSET NUMBERS FOLLOW */
+ /* UPDATED TARGET OFFSET NUMBERS FOLLOW */
+ /* UPDATED TUPLES TO ADD BACK FOLLOW */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, nupdated) + sizeof(uint16))
/*
* This is what we need to know about marking an empty branch for deletion.
@@ -241,6 +284,8 @@ typedef struct xl_btree_newroot
extern void btree_redo(XLogReaderState *record);
extern void btree_desc(StringInfo buf, XLogReaderState *record);
extern const char *btree_identify(uint8 info);
+extern void btree_xlog_startup(void);
+extern void btree_xlog_cleanup(void);
extern void btree_mask(char *pagedata, BlockNumber blkno);
#endif /* NBTXLOG_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..2b8c6c7fc8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask)
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 48377ace24..2b37afd9e5 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,15 @@ static relopt_bool boolRelOpts[] =
},
true
},
+ {
+ {
+ "deduplication",
+ "Enables deduplication on btree index leaf pages",
+ RELOPT_KIND_BTREE,
+ ShareUpdateExclusiveLock
+ },
+ true
+ },
/* list terminator */
{{NULL}}
};
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 2599b5d342..6e1dc596e1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -276,6 +276,10 @@ BuildIndexValueDescription(Relation indexRelation,
/*
* Get the latestRemovedXid from the table entries pointed at by the index
* tuples being deleted.
+ *
+ * Note: index access methods that don't consistently use the standard
+ * IndexTuple + heap TID item pointer representation will need to provide
+ * their own version of this function.
*/
TransactionId
index_compute_xid_horizon_for_tuples(Relation irel,
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index bf245f5dab..d69808e78c 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
nbtcompare.o \
+ nbtdedup.o \
nbtinsert.o \
nbtpage.o \
nbtree.o \
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e75c..54cb9db49d 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DEAD tuples when
we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already). Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
@@ -710,6 +713,75 @@ the fallback strategy assumes that duplicates are mostly inserted in
ascending heap TID order. The page is split in a way that leaves the left
half of the page mostly full, and the right half of the page mostly empty.
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits. Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries. Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split. It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page. We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication. Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages. Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists. A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple. An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication. Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple. When the incoming
+tuple really does overlap with an existing posting list, a posting list
+split is performed. Posting list splits work in a way that more or less
+preserves the illusion that all incoming tuples do not need to be merged
+with any existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list. The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list. The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list. We make
+space in the posting list by removing the heap TID that became the new
+item. The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists. Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists. Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree. Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers. A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates. Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order. In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
Notes About Data Representation
-------------------------------
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
new file mode 100644
index 0000000000..1dbc32b70a
--- /dev/null
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -0,0 +1,715 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtdedup.c
+ * Deduplicate items in Lehman and Yao btrees for Postgres.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/nbtree/nbtdedup.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/nbtxlog.h"
+#include "miscadmin.h"
+#include "utils/rel.h"
+
+
+/*
+ * Try to deduplicate items to free at least enough space to avoid a page
+ * split. This function should be called during insertion, only after LP_DEAD
+ * items were removed by _bt_vacuum_one_page() to prevent a page split.
+ * (We'll have to kill LP_DEAD items here when the page's BTP_HAS_GARBAGE hint
+ * was not set, but that should be rare.)
+ *
+ * The strategy for !checkingunique callers is to perform as much
+ * deduplication as possible to free as much space as possible now, since
+ * making it harder to set LP_DEAD bits is considered an acceptable price for
+ * not having to deduplicate the same page many times. It is unlikely that
+ * the items on the page will have their LP_DEAD bit set in the future, since
+ * that hasn't happened before now (besides, entire posting lists can still
+ * have their LP_DEAD bit set).
+ *
+ * The strategy for checkingunique callers is rather different, since the
+ * overall goal is different. Deduplication cooperates with and enhances
+ * garbage collection, especially the LP_DEAD bit setting that takes place in
+ * _bt_check_unique(). Deduplication does as little as possible while still
+ * preventing a page split for caller, since it's less likely that posting
+ * lists will have their LP_DEAD bit set. Deduplication avoids creating new
+ * posting lists with only two heap TIDs, and also avoids creating new posting
+ * lists from an existing posting list. Deduplication is only useful when it
+ * delays a page split long enough for garbage collection to prevent the page
+ * split altogether. checkingunique deduplication can make all the difference
+ * in cases where VACUUM keeps up with dead index tuples, but "recently dead"
+ * index tuples are still numerous enough to cause page splits that are truly
+ * unnecessary.
+ *
+ * Note: If newitem contains NULL values in key attributes, caller will be
+ * !checkingunique even when rel is a unique index. The page in question will
+ * usually have many existing items with NULLs.
+ */
+void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+ IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ BTPageOpaque oopaque;
+ BTDedupState state = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxIndexTuplesPerPage];
+ bool minimal = checkingunique;
+ int ndeletable = 0;
+ Size pagesaving = 0;
+ int count = 0;
+ bool singlevalue = false;
+
+ oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* init deduplication state needed to build posting tuples */
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ state->rel = rel;
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->newitem = newitem;
+ state->checkingunique = checkingunique;
+ state->skippedbase = InvalidOffsetNumber;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /*
+ * Delete dead tuples if any. We cannot simply skip them in the cycle
+ * below, because it's necessary to generate special Xlog record
+ * containing such tuples to compute latestRemovedXid on a standby server
+ * later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ /*
+ * Skip duplication in rare cases where there were LP_DEAD items
+ * encountered here when that frees sufficient space for caller to
+ * avoid a page split
+ */
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+ if (PageGetFreeSpace(page) >= newitemsz)
+ {
+ pfree(state);
+ return;
+ }
+
+ /* Continue with deduplication */
+ minoff = P_FIRSTDATAKEY(oopaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ }
+
+ /* Make sure that new page won't have garbage flag set */
+ oopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+ /* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
+ newitemsz += sizeof(ItemIdData);
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Determine if a "single value" strategy page split is likely to occur
+ * shortly after deduplication finishes. It should be possible for the
+ * single value split to find a split point that packs the left half of
+ * the split BTREE_SINGLEVAL_FILLFACTOR% full.
+ */
+ if (!checkingunique)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, minoff);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ {
+ itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ /*
+ * Use different strategy if future page split likely to need to
+ * use "single value" strategy
+ */
+ if (_bt_keep_natts_fast(rel, newitem, itup) > natts)
+ singlevalue = true;
+ }
+ }
+
+ /*
+ * Iterate over tuples on the page, try to deduplicate them into posting
+ * lists and insert into new page. NOTE: It's essential to reassess the
+ * max offset on each iteration, since it will change as items are
+ * deduplicated.
+ */
+ offnum = minoff;
+retry:
+ while (offnum <= PageGetMaxOffsetNumber(page))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (state->nitems == 0)
+ {
+ /*
+ * No previous/base tuple for the data item -- use the data item
+ * as base tuple of pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else if (_bt_keep_natts_fast(rel, state->base, itup) > natts &&
+ _bt_dedup_save_htid(state, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list. Heap
+ * TID(s) for itup have been saved in state. The next iteration
+ * will also end up here if it's possible to merge the next tuple
+ * into the same pending posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * _bt_dedup_save_htid() opted to not merge current item into
+ * pending posting list for some other reason (e.g., adding more
+ * TIDs would have caused posting list to exceed BTMaxItemSize()
+ * limit).
+ *
+ * If state contains pending posting list with more than one item,
+ * form new posting tuple, and update the page. Otherwise, reset
+ * the state and move on.
+ */
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+
+ count++;
+
+ /*
+ * When caller is a checkingunique caller and we have deduplicated
+ * enough to avoid a page split, do minimal deduplication in case
+ * the remaining items are about to be marked dead within
+ * _bt_check_unique().
+ */
+ if (minimal && pagesaving >= newitemsz)
+ break;
+
+ /*
+ * Consider special steps when a future page split of the leaf
+ * page is likely to occur using nbtsplitloc.c's "single value"
+ * strategy
+ */
+ if (singlevalue)
+ {
+ /*
+ * Adjust maxitemsize so that there isn't a third and final
+ * 1/3 of a page width tuple that fills the page to capacity.
+ * The third tuple produced should be smaller than the first
+ * two by an amount equal to the free space that nbtsplitloc.c
+ * is likely to want to leave behind when the page it split.
+ * When there are 3 posting lists on the page, then we end
+ * deduplication. Remaining tuples on the page can be
+ * deduplicated later, when they're on the new right sibling
+ * of this page, and the new sibling page needs to be split in
+ * turn.
+ *
+ * Note that it doesn't matter if there are items on the page
+ * that were already 1/3 of a page during current pass;
+ * they'll still count as the first two posting list tuples.
+ */
+ if (count == 2)
+ {
+ Size leftfree;
+
+ /* This calculation needs to match nbtsplitloc.c */
+ leftfree = PageGetPageSize(page) - SizeOfPageHeaderData -
+ MAXALIGN(sizeof(BTPageOpaqueData));
+ /* Subtract predicted size of new high key */
+ leftfree -= newitemsz + MAXALIGN(sizeof(ItemPointerData));
+
+ /*
+ * Reduce maxitemsize by an amount equal to target free
+ * space on left half of page
+ */
+ state->maxitemsize -= leftfree *
+ ((100 - BTREE_SINGLEVAL_FILLFACTOR) / 100.0);
+ }
+ else if (count == 3)
+ break;
+ }
+
+ /*
+ * Next iteration starts immediately after base tuple offset (this
+ * will be the next offset on the page when we didn't modify the
+ * page)
+ */
+ offnum = state->baseoff;
+ }
+
+ offnum = OffsetNumberNext(offnum);
+ }
+
+ /* Handle the last item when pending posting list is not empty */
+ if (state->nitems != 0)
+ {
+ pagesaving += _bt_dedup_finish_pending(buffer, state,
+ RelationNeedsWAL(rel));
+ count++;
+ }
+
+ if (pagesaving < newitemsz && state->skippedbase != InvalidOffsetNumber)
+ {
+ /*
+ * Didn't free enough space for new item in first checkingunique pass.
+ * Try making a second pass over the page, this time starting from the
+ * first candidate posting list base offset that was skipped over in
+ * the first pass (only do a second pass when this actually happened).
+ *
+ * The second pass over the page may deduplicate items that were
+ * initially passed over due to concerns about limiting the
+ * effectiveness of LP_DEAD bit setting within _bt_check_unique().
+ * Note that the second pass will still stop deduplicating as soon as
+ * enough space has been freed to avoid an immediate page split.
+ */
+ Assert(state->checkingunique);
+ offnum = state->skippedbase;
+
+ state->checkingunique = false;
+ state->skippedbase = InvalidOffsetNumber;
+ state->alltupsize = 0;
+ state->nitems = 0;
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+ goto retry;
+ }
+
+ /* Local space accounting should agree with page accounting */
+ Assert(pagesaving < newitemsz || PageGetExactFreeSpace(page) >= newitemsz);
+
+ /* be tidy */
+ pfree(state->htids);
+ pfree(state);
+}
+
+/*
+ * Create a new pending posting list tuple based on caller's tuple.
+ *
+ * Every tuple processed by the deduplication routines either becomes the base
+ * tuple for a posting list, or gets its heap TID(s) accepted into a pending
+ * posting list. A tuple that starts out as the base tuple for a posting list
+ * will only actually be rewritten within _bt_dedup_finish_pending() when
+ * there was at least one successful call to _bt_dedup_save_htid().
+ */
+void
+_bt_dedup_start_pending(BTDedupState state, IndexTuple base,
+ OffsetNumber baseoff)
+{
+ Assert(state->nhtids == 0);
+ Assert(state->nitems == 0);
+
+ /*
+ * Copy heap TIDs from new base tuple for new candidate posting list into
+ * ipd array. Assume that we'll eventually create a new posting tuple by
+ * merging later tuples with this existing one, though we may not.
+ */
+ if (!BTreeTupleIsPosting(base))
+ {
+ memcpy(state->htids, base, sizeof(ItemPointerData));
+ state->nhtids = 1;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = IndexTupleSize(base);
+ }
+ else
+ {
+ int nposting;
+
+ nposting = BTreeTupleGetNPosting(base);
+ memcpy(state->htids, BTreeTupleGetPosting(base),
+ sizeof(ItemPointerData) * nposting);
+ state->nhtids = nposting;
+ /* Save size of tuple without any posting list */
+ state->basetupsize = BTreeTupleGetPostingOffset(base);
+ }
+
+ /*
+ * Save new base tuple itself -- it'll be needed if we actually create a
+ * new posting list from new pending posting list.
+ *
+ * Must maintain size of all tuples (including line pointer overhead) to
+ * calculate space savings on page within _bt_dedup_finish_pending().
+ * Also, save number of base tuple logical tuples so that we can save
+ * cycles in the common case where an existing posting list can't or won't
+ * be merged with other tuples on the page.
+ */
+ state->nitems = 1;
+ state->base = base;
+ state->baseoff = baseoff;
+ state->alltupsize = MAXALIGN(IndexTupleSize(base)) + sizeof(ItemIdData);
+ /* Also save baseoff in pending state for interval */
+ state->interval.baseoff = state->baseoff;
+ state->overlap = false;
+ if (state->newitem)
+ {
+ /* Might overlap with new item -- mark it as possible if it is */
+ if (BTreeTupleGetHeapTID(base) < BTreeTupleGetHeapTID(state->newitem))
+ state->overlap = true;
+ }
+}
+
+/*
+ * Save itup heap TID(s) into pending posting list where possible.
+ *
+ * Returns bool indicating if the pending posting list managed by state has
+ * itup's heap TID(s) saved. When this is false, enlarging the pending
+ * posting list by the required amount would exceed the maxitemsize limit, so
+ * caller must finish the pending posting list tuple. (Generally itup becomes
+ * the base tuple of caller's new pending posting list).
+ */
+bool
+_bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
+{
+ int nhtids;
+ ItemPointer htids;
+ Size mergedtupsz;
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ nhtids = 1;
+ htids = &itup->t_tid;
+ }
+ else
+ {
+ nhtids = BTreeTupleGetNPosting(itup);
+ htids = BTreeTupleGetPosting(itup);
+ }
+
+ /*
+ * Don't append (have caller finish pending posting list as-is) if
+ * appending heap TID(s) from itup would put us over limit
+ */
+ mergedtupsz = MAXALIGN(state->basetupsize +
+ (state->nhtids + nhtids) *
+ sizeof(ItemPointerData));
+
+ if (mergedtupsz > state->maxitemsize)
+ return false;
+
+ /* Don't merge existing posting lists with checkingunique */
+ if (state->checkingunique &&
+ (BTreeTupleIsPosting(state->base) || nhtids > 1))
+ {
+ /* May begin here if second pass over page is required */
+ if (state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ return false;
+ }
+
+ if (state->overlap)
+ {
+ if (BTreeTupleGetMaxHeapTID(itup) > BTreeTupleGetHeapTID(state->newitem))
+ {
+ /*
+ * newitem has heap TID in the range of the would-be new posting
+ * list. Avoid an immediate posting list split for caller.
+ */
+ if (_bt_keep_natts_fast(state->rel, state->newitem, itup) >
+ IndexRelationGetNumberOfAttributes(state->rel))
+ {
+ state->newitem = NULL; /* avoid unnecessary comparisons */
+ return false;
+ }
+ }
+ }
+
+ /*
+ * Save heap TIDs to pending posting list tuple -- itup can be merged into
+ * pending posting list
+ */
+ state->nitems++;
+ memcpy(state->htids + state->nhtids, htids,
+ sizeof(ItemPointerData) * nhtids);
+ state->nhtids += nhtids;
+ state->alltupsize += MAXALIGN(IndexTupleSize(itup)) + sizeof(ItemIdData);
+
+ return true;
+}
+
+/*
+ * Finalize pending posting list tuple, and add it to the page. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * Returns space saving from deduplicating to make a new posting list tuple.
+ * Note that this includes line pointer overhead. This is zero in the case
+ * where no deduplication was possible.
+ */
+Size
+_bt_dedup_finish_pending(Buffer buffer, BTDedupState state, bool need_wal)
+{
+ Size spacesaving = 0;
+ Page page = BufferGetPage(buffer);
+ int minimum = 2;
+
+ Assert(state->nitems > 0);
+ Assert(state->nitems <= state->nhtids);
+ Assert(state->interval.baseoff == state->baseoff);
+
+ /*
+ * Only create a posting list when at least 3 heap TIDs will appear in the
+ * checkingunique case (checkingunique strategy won't merge existing
+ * posting list tuples, so we know that the number of items here must also
+ * be the total number of heap TIDs). Creating a new posting lists with
+ * only two heap TIDs won't even save enough space to fit another
+ * duplicate with the same key as the posting list. This is a bad
+ * trade-off if there is a chance that the LP_DEAD bit can be set for
+ * either existing tuple by putting off deduplication.
+ *
+ * (Note that a second pass over the page can deduplicate the item if that
+ * is truly the only way to avoid a page split for checkingunique caller)
+ */
+ Assert(!state->checkingunique || state->nitems == 1 ||
+ state->nhtids == state->nitems);
+ if (state->checkingunique)
+ {
+ minimum = 3;
+ /* May begin here if second pass over page is required */
+ if (state->nitems == 2 && state->skippedbase == InvalidOffsetNumber)
+ state->skippedbase = state->baseoff;
+ }
+
+ if (state->nitems >= minimum)
+ {
+ IndexTuple final;
+ Size finalsz;
+ OffsetNumber offnum;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /* find all tuples that will be replaced with this new posting tuple */
+ for (offnum = state->baseoff;
+ offnum < state->baseoff + state->nitems;
+ offnum = OffsetNumberNext(offnum))
+ deletable[ndeletable++] = offnum;
+
+ /* Form a tuple with a posting list */
+ final = _bt_form_posting(state->base, state->htids, state->nhtids);
+ finalsz = IndexTupleSize(final);
+ spacesaving = state->alltupsize - (finalsz + sizeof(ItemIdData));
+ /* Must have saved some space */
+ Assert(spacesaving > 0 && spacesaving < BLCKSZ);
+
+ /* Save final number of items for posting list */
+ state->interval.nitems = state->nitems;
+
+ Assert(finalsz <= state->maxitemsize);
+ Assert(finalsz == MAXALIGN(IndexTupleSize(final)));
+
+ START_CRIT_SECTION();
+
+ /* Delete items to replace */
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ /* Insert posting tuple */
+ if (PageAddItem(page, (Item) final, finalsz, state->baseoff, false,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "deduplication failed to add tuple to page");
+
+ MarkBufferDirty(buffer);
+
+ /* Log deduplicated items */
+ if (need_wal)
+ {
+ XLogRecPtr recptr;
+ xl_btree_dedup xlrec_dedup;
+
+ xlrec_dedup.baseoff = state->interval.baseoff;
+ xlrec_dedup.nitems = state->interval.nitems;
+
+ XLogBeginInsert();
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ pfree(final);
+ }
+
+ /* Reset state for next pending posting list */
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+
+ return spacesaving;
+}
+
+/*
+ * Build a posting list tuple from a "base" index tuple and a list of heap
+ * TIDs for posting list.
+ *
+ * Caller's "htids" array must be sorted in ascending order. Any heap TIDs
+ * from caller's base tuple will not appear in returned posting list.
+ *
+ * If nhtids == 1, builds a non-posting tuple (posting list tuples can never
+ * have a single heap TID).
+ */
+IndexTuple
+_bt_form_posting(IndexTuple tuple, ItemPointer htids, int nhtids)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nhtids > 0 && nhtids <= PG_UINT16_MAX);
+
+ /* Add space needed for posting list */
+ if (nhtids > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nhtids;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nhtids > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nhtids, SHORTALIGN(keysize));
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), htids,
+ sizeof(ItemPointerData) * nhtids);
+
+#ifdef USE_ASSERT_CHECKING
+ {
+ /* Assert that htid array is sorted and has unique TIDs */
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ current = BTreeTupleGetPostingN(itup, i);
+ Assert(ItemPointerCompare(current, &last) > 0);
+ ItemPointerCopy(current, &last);
+ }
+ }
+#endif
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from htids */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(htids, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Prepare for a posting list split by swapping heap TID in newitem with heap
+ * TID from original posting list (the 'oposting' heap TID located at offset
+ * 'postingoff').
+ *
+ * Returns new posting list tuple, which is palloc()'d in caller's context.
+ * This is guaranteed to be the same size as 'oposting'. Modified version of
+ * newitem is what caller actually inserts inside the critical section that
+ * also performs an in-place update of posting list.
+ *
+ * Explicit WAL-logging of newitem must use the original version of newitem in
+ * order to make it possible for our nbtxlog.c callers to correctly REDO
+ * original steps. This approach avoids any explicit WAL-logging of a posting
+ * list tuple. This is important because posting lists are often much larger
+ * than plain tuples.
+ *
+ * Caller should avoid assuming that the IndexTuple-wise key representation in
+ * newitem is bitwise equal to the representation used within oposting. Note,
+ * in particular, that one may even be larger than the other. This could
+ * occur due to differences in TOAST input state, for example.
+ */
+IndexTuple
+_bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
+{
+ int nhtids;
+ char *replacepos;
+ char *rightpos;
+ Size nbytes;
+ IndexTuple nposting;
+
+ nhtids = BTreeTupleGetNPosting(oposting);
+ Assert(postingoff > 0 && postingoff < nhtids);
+
+ nposting = CopyIndexTuple(oposting);
+ replacepos = (char *) BTreeTupleGetPostingN(nposting, postingoff);
+ rightpos = replacepos + sizeof(ItemPointerData);
+ nbytes = (nhtids - postingoff - 1) * sizeof(ItemPointerData);
+
+ /*
+ * Move item pointers in posting list to make a gap for the new item's
+ * heap TID (shift TIDs one place to the right, losing original rightmost
+ * TID)
+ */
+ memmove(rightpos, replacepos, nbytes);
+
+ /* Fill the gap with the TID of the new item */
+ ItemPointerCopy(&newitem->t_tid, (ItemPointer) replacepos);
+
+ /* Copy original posting list's rightmost TID into new item */
+ ItemPointerCopy(BTreeTupleGetPostingN(oposting, nhtids - 1),
+ &newitem->t_tid);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(nposting),
+ BTreeTupleGetHeapTID(newitem)) < 0);
+ Assert(BTreeTupleGetNPosting(oposting) == BTreeTupleGetNPosting(nposting));
+
+ return nposting;
+}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b93b2a0ffd..d816c45f2c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -28,6 +28,8 @@
/* Minimum tree height for application of fastpath optimization */
#define BTREE_FASTPATH_MIN_LEVEL 2
+/* GUC parameter */
+int btree_deduplication = DEDUP_NONUNIQUE;
static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
@@ -47,10 +49,12 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page);
static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
- IndexTuple newitem);
+ IndexTuple newitem, IndexTuple orignewitem,
+ IndexTuple nposting, uint16 postingoff);
static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
BTStack stack, bool is_root, bool is_only);
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
@@ -61,7 +65,8 @@ static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
* This routine is called by the public interface routine, btinsert.
- * By here, itup is filled in, including the TID.
+ * By here, itup is filled in, including the TID. Caller should be
+ * prepared for us to scribble on 'itup'.
*
* If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
* will allow duplicates. Otherwise (UNIQUE_CHECK_YES or
@@ -125,6 +130,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
insertstate.itup_key = itup_key;
insertstate.bounds_valid = false;
insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
/*
* It's very common to have an index on an auto-incremented or
@@ -300,7 +306,7 @@ top:
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+ itup, newitemoff, insertstate.postingoff, false);
}
else
{
@@ -353,6 +359,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
BTPageOpaque opaque;
Buffer nbuf = InvalidBuffer;
bool found = false;
+ bool inposting = false;
+ bool prev_all_dead = true;
+ int curposti = 0;
/* Assume unique until we find a duplicate */
*is_unique = true;
@@ -374,6 +383,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/*
* Scan over all equal tuples, looking for live conflicts.
+ *
+ * Note that each iteration of the loop processes one heap TID, not one
+ * index tuple. The page offset number won't be advanced for iterations
+ * which process heap TIDs from posting list tuples until the last such
+ * heap TID for the posting list (curposti will be advanced instead).
*/
Assert(!insertstate->bounds_valid || insertstate->low == offset);
Assert(!itup_key->anynullkeys);
@@ -435,7 +449,27 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
- htid = curitup->t_tid;
+
+ /*
+ * decide if this is the first heap TID in tuple we'll
+ * process, or if we should continue to process current
+ * posting list
+ */
+ if (!BTreeTupleIsPosting(curitup))
+ {
+ htid = curitup->t_tid;
+ inposting = false;
+ }
+ else if (!inposting)
+ {
+ /* First heap TID in posting list */
+ inposting = true;
+ prev_all_dead = true;
+ curposti = 0;
+ }
+
+ if (inposting)
+ htid = *BTreeTupleGetPostingN(curitup, curposti);
/*
* If we are doing a recheck, we expect to find the tuple we
@@ -511,8 +545,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* not part of this chain because it had a different index
* entry.
*/
- htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
+ if (table_index_fetch_tuple_check(heapRel, &itup->t_tid,
SnapshotSelf, NULL))
{
/* Normal case --- it's still live */
@@ -570,12 +603,14 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
RelationGetRelationName(rel))));
}
}
- else if (all_dead)
+ else if (all_dead && (!inposting ||
+ (prev_all_dead &&
+ curposti == BTreeTupleGetNPosting(curitup) - 1)))
{
/*
- * The conflicting tuple (or whole HOT chain) is dead to
- * everyone, so we may as well mark the index entry
- * killed.
+ * The conflicting tuple (or all HOT chains pointed to by
+ * all posting list TIDs) is dead to everyone, so mark the
+ * index entry killed.
*/
ItemIdMarkDead(curitemid);
opaque->btpo_flags |= BTP_HAS_GARBAGE;
@@ -589,14 +624,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
else
MarkBufferDirtyHint(insertstate->buf, true);
}
+
+ /*
+ * Remember if posting list tuple has even a single HOT chain
+ * whose members are not all dead
+ */
+ if (!all_dead && inposting)
+ prev_all_dead = false;
}
}
- /*
- * Advance to next tuple to continue checking.
- */
- if (offset < maxoff)
+ if (inposting && curposti < BTreeTupleGetNPosting(curitup) - 1)
+ {
+ /* Advance to next TID in same posting list */
+ curposti++;
+ continue;
+ }
+ else if (offset < maxoff)
+ {
+ /* Advance to next tuple */
+ curposti = 0;
+ inposting = false;
offset = OffsetNumberNext(offset);
+ }
else
{
int highkeycmp;
@@ -621,6 +671,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
+ curposti = 0;
+ inposting = false;
maxoff = PageGetMaxOffsetNumber(page);
offset = P_FIRSTDATAKEY(opaque);
/* Don't invalidate binary search bounds */
@@ -689,6 +741,7 @@ _bt_findinsertloc(Relation rel,
BTScanInsert itup_key = insertstate->itup_key;
Page page = BufferGetPage(insertstate->buf);
BTPageOpaque lpageop;
+ OffsetNumber location;
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -751,13 +804,25 @@ _bt_findinsertloc(Relation rel,
/*
* If the target page is full, see if we can obtain enough space by
- * erasing LP_DEAD items
+ * erasing LP_DEAD items. If that doesn't work out, and if the index
+ * deduplication is both possible and enabled, try deduplication.
*/
- if (PageGetFreeSpace(page) < insertstate->itemsz &&
- P_HAS_GARBAGE(lpageop))
+ if (PageGetFreeSpace(page) < insertstate->itemsz)
{
- _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
- insertstate->bounds_valid = false;
+ if (P_HAS_GARBAGE(lpageop))
+ {
+ _bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false;
+ }
+
+ if (insertstate->itup_key->safededup && BTGetUseDedup(rel) &&
+ PageGetFreeSpace(page) < insertstate->itemsz)
+ {
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel,
+ insertstate->itup, insertstate->itemsz,
+ checkingunique);
+ insertstate->bounds_valid = false;
+ }
}
}
else
@@ -839,7 +904,38 @@ _bt_findinsertloc(Relation rel,
Assert(P_RIGHTMOST(lpageop) ||
_bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
- return _bt_binsrch_insert(rel, insertstate);
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Insertion is not prepared for the case where an LP_DEAD posting list
+ * tuple must be split. In the unlikely event that this happens, call
+ * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+ */
+ if (unlikely(insertstate->postingoff == -1))
+ {
+ Assert(insertstate->itup_key->safededup);
+
+ /*
+ * Don't check if the option is enabled, since no actual deduplication
+ * will be done, just cleanup.
+ */
+ _bt_dedup_one_page(rel, insertstate->buf, heapRel, insertstate->itup,
+ 0, checkingunique);
+ Assert(!P_HAS_GARBAGE(lpageop));
+
+ /* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+ insertstate->bounds_valid = false;
+ insertstate->postingoff = 0;
+ location = _bt_binsrch_insert(rel, insertstate);
+
+ /*
+ * Might still have to split some other posting list now, but that
+ * should never be LP_DEAD
+ */
+ Assert(insertstate->postingoff >= 0);
+ }
+
+ return location;
}
/*
@@ -905,10 +1001,12 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* This recursive procedure does the following things:
*
+ * + if necessary, splits an existing posting list on page.
+ * This is only needed when 'postingoff' is non-zero.
* + if necessary, splits the target page, using 'itup_key' for
* suffix truncation on leaf pages (caller passes NULL for
* non-leaf pages).
- * + inserts the tuple.
+ * + inserts the new tuple (could be from split posting list).
* + if the page was split, pops the parent stack, and finds the
* right place to insert the new child pointer (by walking
* right using information stored in the parent stack).
@@ -918,7 +1016,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
*
* On entry, we must have the correct buffer in which to do the
* insertion, and the buffer must be pinned and write-locked. On return,
- * we will have dropped both the pin and the lock on the buffer.
+ * we will have dropped both the pin and the lock on the buffer. Caller
+ * should be prepared for us to scribble on 'itup'.
*
* This routine only performs retail tuple insertions. 'itup' should
* always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1035,15 @@ _bt_insertonpg(Relation rel,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
+ int postingoff,
bool split_only_page)
{
Page page;
BTPageOpaque lpageop;
Size itemsz;
+ IndexTuple oposting;
+ IndexTuple origitup = NULL;
+ IndexTuple nposting = NULL;
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1057,8 @@ _bt_insertonpg(Relation rel,
Assert(P_ISLEAF(lpageop) ||
BTreeTupleGetNAtts(itup, rel) <=
IndexRelationGetNumberOfKeyAttributes(rel));
+ /* retail insertions of posting list tuples are disallowed */
+ Assert(!BTreeTupleIsPosting(itup));
/* The caller should've finished any incomplete splits already. */
if (P_INCOMPLETE_SPLIT(lpageop))
@@ -964,6 +1069,39 @@ _bt_insertonpg(Relation rel,
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
+ /*
+ * Do we need to split an existing posting list item?
+ */
+ if (postingoff != 0)
+ {
+ ItemId itemid = PageGetItemId(page, newitemoff);
+
+ /*
+ * The new tuple is a duplicate with a heap TID that falls inside the
+ * range of an existing posting list tuple on a leaf page. Prepare to
+ * split an existing posting list by swapping new item's heap TID with
+ * the rightmost heap TID from original posting list, and generating a
+ * new version of the posting list that has new item's heap TID.
+ *
+ * Posting list splits work by modifying the overlapping posting list
+ * as part of the same atomic operation that inserts the "new item".
+ * The space accounting is kept simple, since it does not need to
+ * consider posting list splits at all (this is particularly important
+ * for the case where we also have to split the page). Overwriting
+ * the posting list with its post-split version is treated as an extra
+ * step in either the insert or page split critical section.
+ */
+ Assert(P_ISLEAF(lpageop) && !ItemIdIsDead(itemid));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* save a copy of itup with unchanged TID for xlog record */
+ origitup = CopyIndexTuple(itup);
+ nposting = _bt_swap_posting(itup, oposting, postingoff);
+
+ /* Alter offset so that it goes after existing posting list */
+ newitemoff = OffsetNumberNext(newitemoff);
+ }
+
/*
* Do we need to split the page to fit the item on it?
*
@@ -996,7 +1134,8 @@ _bt_insertonpg(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+ rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+ origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
@@ -1075,6 +1214,13 @@ _bt_insertonpg(Relation rel,
elog(PANIC, "failed to add new item to block %u in index \"%s\"",
itup_blkno, RelationGetRelationName(rel));
+ /*
+ * Posting list split requires an in-place update of the existing
+ * posting list
+ */
+ if (nposting)
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
MarkBufferDirty(buf);
if (BufferIsValid(metabuf))
@@ -1120,8 +1266,19 @@ _bt_insertonpg(Relation rel,
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
- if (P_ISLEAF(lpageop))
+ if (P_ISLEAF(lpageop) && postingoff == 0)
+ {
+ /* Simple leaf insert */
xlinfo = XLOG_BTREE_INSERT_LEAF;
+ }
+ else if (postingoff != 0)
+ {
+ /*
+ * Leaf insert with posting list split. Must include
+ * postingoff field before newitem/orignewitem.
+ */
+ xlinfo = XLOG_BTREE_INSERT_POST;
+ }
else
{
/*
@@ -1144,6 +1301,7 @@ _bt_insertonpg(Relation rel,
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples =
metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT | REGBUF_STANDARD);
XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));
@@ -1152,7 +1310,28 @@ _bt_insertonpg(Relation rel,
}
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+
+ /*
+ * We always write newitem to the page, but when there is an
+ * original newitem due to a posting list split then we log the
+ * original item instead. REDO routine must reconstruct the final
+ * newitem at the same time it reconstructs nposting.
+ */
+ if (postingoff == 0)
+ XLogRegisterBufData(0, (char *) itup,
+ IndexTupleSize(itup));
+ else
+ {
+ /*
+ * Must explicitly log posting off before newitem in case of
+ * posting list split.
+ */
+ uint16 upostingoff = postingoff;
+
+ XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
+ XLogRegisterBufData(0, (char *) origitup,
+ IndexTupleSize(origitup));
+ }
recptr = XLogInsert(RM_BTREE_ID, xlinfo);
@@ -1194,6 +1373,13 @@ _bt_insertonpg(Relation rel,
_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, cachedBlock);
}
+
+ /* be tidy */
+ if (postingoff != 0)
+ {
+ pfree(nposting);
+ pfree(origitup);
+ }
}
/*
@@ -1209,12 +1395,25 @@ _bt_insertonpg(Relation rel,
* This function will clear the INCOMPLETE_SPLIT flag on it, and
* release the buffer.
*
+ * orignewitem, nposting, and postingoff are needed when an insert of
+ * orignewitem results in both a posting list split and a page split.
+ * newitem and nposting are replacements for orignewitem and the
+ * existing posting list on the page respectively. These extra
+ * posting list split details are used here in the same way as they
+ * are used in the more common case where a posting list split does
+ * not coincide with a page split. We need to deal with posting list
+ * splits directly in order to ensure that everything that follows
+ * from the insert of orignewitem is handled as a single atomic
+ * operation (though caller's insert of a new pivot/downlink into
+ * parent page will still be a separate operation).
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
_bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
- OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+ OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+ IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
{
Buffer rbuf;
Page origpage;
@@ -1236,12 +1435,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
OffsetNumber firstright;
OffsetNumber maxoff;
OffsetNumber i;
+ OffsetNumber replacepostingoff = InvalidOffsetNumber;
bool newitemonleft,
isleaf;
IndexTuple lefthikey;
int indnatts = IndexRelationGetNumberOfAttributes(rel);
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+ /*
+ * Determine offset number of existing posting list on page when a split
+ * of a posting list needs to take place as the page is split
+ */
+ if (nposting != NULL)
+ {
+ Assert(itup_key->heapkeyspace);
+ replacepostingoff = OffsetNumberPrev(newitemoff);
+ }
+
/*
* origpage is the original page to be split. leftpage is a temporary
* buffer that receives the left-sibling data, which will be copied back
@@ -1273,6 +1483,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* newitemoff == firstright. In all other cases it's clear which side of
* the split every tuple goes on from context. newitemonleft is usually
* (but not always) redundant information.
+ *
+ * Note: In theory, the split point choice logic should operate against a
+ * version of the page that already replaced the posting list at offset
+ * replacepostingoff with nposting where applicable. We don't bother with
+ * that, though. Both versions of the posting list must be the same size,
+ * and both will have the same base tuple key values, so split point
+ * choice is never affected.
*/
firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
@@ -1340,6 +1557,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemid = PageGetItemId(origpage, firstright);
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (firstright == replacepostingoff)
+ item = nposting;
}
/*
@@ -1373,6 +1593,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
itemid = PageGetItemId(origpage, lastleftoff);
lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+ /* Behave as if origpage posting list has already been swapped */
+ if (lastleftoff == replacepostingoff)
+ lastleft = nposting;
}
Assert(lastleft != item);
@@ -1480,8 +1703,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
itemsz = ItemIdGetLength(itemid);
item = (IndexTuple) PageGetItem(origpage, itemid);
+ /*
+ * did caller pass new replacement posting list tuple due to posting
+ * list split?
+ */
+ if (i == replacepostingoff)
+ {
+ /*
+ * swap origpage posting list with post-posting-list-split version
+ * from caller
+ */
+ Assert(isleaf);
+ Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+ item = nposting;
+ }
+
/* does new item belong before this one? */
- if (i == newitemoff)
+ else if (i == newitemoff)
{
if (newitemonleft)
{
@@ -1650,8 +1888,12 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
XLogRecPtr recptr;
xlrec.level = ropaque->btpo.level;
+ /* See comments below on newitem, orignewitem, and posting lists */
xlrec.firstright = firstright;
xlrec.newitemoff = newitemoff;
+ xlrec.postingoff = 0;
+ if (replacepostingoff < firstright)
+ xlrec.postingoff = postingoff;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
@@ -1670,11 +1912,45 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* because it's included with all the other items on the right page.)
* Show the new item as belonging to the left page buffer, so that it
* is not stored if XLogInsert decides it needs a full-page image of
- * the left page. We store the offset anyway, though, to support
- * archive compression of these records.
+ * the left page. We always store newitemoff in the record, though.
+ *
+ * The details are sometimes slightly different for page splits that
+ * coincide with a posting list split. If both the replacement
+ * posting list and newitem go on the right page, then we don't need
+ * to log anything extra, just like the simple !newitemonleft
+ * no-posting-split case (postingoff is set to zero in the WAL record,
+ * so recovery doesn't need to process a posting list split at all).
+ * Otherwise, we set postingoff and log orignewitem instead of
+ * newitem, despite having actually inserted newitem. Recovery must
+ * reconstruct nposting and newitem using _bt_swap_posting().
+ *
+ * Note: It's possible that our page split point is the point that
+ * makes the posting list lastleft and newitem firstright. This is
+ * the only case where we log orignewitem despite newitem going on the
+ * right page. If XLogInsert decides that it can omit orignewitem due
+ * to logging a full-page image of the left page, everything still
+ * works out, since recovery only needs to log orignewitem for items
+ * on the left page (just like the regular newitem-logged case).
*/
- if (newitemonleft)
- XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ if (newitemonleft || xlrec.postingoff != 0)
+ {
+ if (xlrec.postingoff == 0)
+ {
+ /* Must WAL-log newitem, since it's on left page */
+ Assert(newitemonleft);
+ Assert(orignewitem == NULL && nposting == NULL);
+ XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+ }
+ else
+ {
+ /* Must WAL-log orignewitem following posting list split */
+ Assert(newitemonleft || firstright == newitemoff);
+ Assert(ItemPointerCompare(&orignewitem->t_tid,
+ &newitem->t_tid) < 0);
+ XLogRegisterBufData(0, (char *) orignewitem,
+ MAXALIGN(IndexTupleSize(orignewitem)));
+ }
+ }
/* Log the left page's new high key */
itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2110,7 @@ _bt_insert_parent(Relation rel,
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
- new_item, stack->bts_offset + 1,
+ new_item, stack->bts_offset + 1, 0,
is_only);
/* be tidy */
@@ -2190,6 +2466,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
md.fastlevel = metad->btm_level;
md.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
md.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -2303,6 +2580,6 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* Note: if we didn't find any LP_DEAD items, then the page's
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
* separate write to clear it, however. We will clear it when we split
- * the page.
+ * the page (or when deduplication runs).
*/
}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 66c79623cf..3b49eb0762 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
#include "access/nbtree.h"
#include "access/nbtxlog.h"
+#include "access/tableam.h"
#include "access/transam.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -42,12 +43,18 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
BlockNumber *target, BlockNumber *rightsib);
static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+ Relation heapRel,
+ Buffer buf,
+ OffsetNumber *itemnos,
+ int nitems);
/*
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
*/
void
-_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
+_bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
+ bool safededup)
{
BTMetaPageData *metad;
BTPageOpaque metaopaque;
@@ -63,6 +70,7 @@ _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level)
metad->btm_fastlevel = level;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ metad->btm_safededup = safededup;
metaopaque = (BTPageOpaque) PageGetSpecialPointer(page);
metaopaque->btpo_flags = BTP_META;
@@ -102,6 +110,9 @@ _bt_upgrademetapage(Page page)
metad->btm_version = BTREE_NOVAC_VERSION;
metad->btm_oldest_btpo_xact = InvalidTransactionId;
metad->btm_last_cleanup_num_heap_tuples = -1.0;
+ /* Only a REINDEX can set this field */
+ Assert(!metad->btm_safededup);
+ metad->btm_safededup = false;
/* Adjust pd_lower (see _bt_initmetapage() for details) */
((PageHeader) page)->pd_lower =
@@ -213,6 +224,7 @@ _bt_update_meta_cleanup_info(Relation rel, TransactionId oldestBtpoXact,
md.fastlevel = metad->btm_fastlevel;
md.oldest_btpo_xact = oldestBtpoXact;
md.last_cleanup_num_heap_tuples = numHeapTuples;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(0, (char *) &md, sizeof(xl_btree_metadata));
@@ -274,6 +286,8 @@ _bt_getroot(Relation rel, int access)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup ||
+ metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_root != P_NONE);
rootblkno = metad->btm_fastroot;
@@ -394,6 +408,7 @@ _bt_getroot(Relation rel, int access)
md.fastlevel = 0;
md.oldest_btpo_xact = InvalidTransactionId;
md.last_cleanup_num_heap_tuples = -1.0;
+ md.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(2, (char *) &md, sizeof(xl_btree_metadata));
@@ -618,6 +633,7 @@ _bt_getrootheight(Relation rel)
Assert(metad->btm_magic == BTREE_MAGIC);
Assert(metad->btm_version >= BTREE_MIN_VERSION);
Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
Assert(metad->btm_fastroot != P_NONE);
return metad->btm_fastlevel;
@@ -683,6 +699,56 @@ _bt_heapkeyspace(Relation rel)
return metad->btm_version > BTREE_NOVAC_VERSION;
}
+/*
+ * _bt_safededup() -- can deduplication safely be used by index?
+ *
+ * Uses field from index relation's metapage/cached metapage.
+ */
+bool
+_bt_safededup(Relation rel)
+{
+ BTMetaPageData *metad;
+
+ if (rel->rd_amcache == NULL)
+ {
+ Buffer metabuf;
+
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
+ metad = _bt_getmeta(rel, metabuf);
+
+ /*
+ * If there's no root page yet, _bt_getroot() doesn't expect a cache
+ * to be made, so just stop here. (XXX perhaps _bt_getroot() should
+ * be changed to allow this case.)
+ *
+ * Note that we rely on the assumption that this field will be zero'ed
+ * on indexes that were pg_upgrade'd.
+ */
+ if (metad->btm_root == P_NONE)
+ {
+ _bt_relbuf(rel, metabuf);
+ return metad->btm_safededup;;
+ }
+
+ /* Cache the metapage data for next time */
+ rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
+ sizeof(BTMetaPageData));
+ memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+ _bt_relbuf(rel, metabuf);
+ }
+
+ /* Get cached page */
+ metad = (BTMetaPageData *) rel->rd_amcache;
+ /* We shouldn't have cached it if any of these fail */
+ Assert(metad->btm_magic == BTREE_MAGIC);
+ Assert(metad->btm_version >= BTREE_MIN_VERSION);
+ Assert(metad->btm_version <= BTREE_VERSION);
+ Assert(!metad->btm_safededup || metad->btm_version > BTREE_NOVAC_VERSION);
+ Assert(metad->btm_fastroot != P_NONE);
+
+ return metad->btm_safededup;
+}
+
/*
* _bt_checkpage() -- Verify that a freshly-read page looks sane.
*/
@@ -968,27 +1034,73 @@ _bt_page_recyclable(Page page)
* deleting the page it points to.
*
* This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given deletable array *must* be sorted in ascending order.
+ * Also, the given deletable and updateitemnos arrays *must* be sorted in
+ * ascending order.
*
* We record VACUUMs and b-tree deletes differently in WAL. Deletes must
* generate recovery conflicts by accessing the heap inline, whereas VACUUMs
* can rely on the initial heap scan taking care of the problem (pruning would
- * have generated the conflicts needed for hot standby already).
+ * have generated the conflicts needed for hot standby already). Also,
+ * VACUUMs must deal with the case where posting list tuples have some dead
+ * TIDs, and some remaining TIDs that must not be killed.
*/
void
-_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
- int ndeletable)
+_bt_delitems_vacuum(Relation rel, Buffer buf,
+ OffsetNumber *deletable, int ndeletable,
+ OffsetNumber *updateitemnos,
+ IndexTuple *updated, int nupdatable)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size updated_sz = 0;
+ char *updated_buf = NULL;
- Assert(ndeletable > 0);
+ Assert(ndeletable > 0 || nupdatable > 0);
+
+ /* XLOG stuff, buffer for updated */
+ if (nupdatable > 0 && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nupdatable; i++)
+ updated_sz += MAXALIGN(IndexTupleSize(updated[i]));
+
+ updated_buf = palloc(updated_sz);
+ for (int i = 0; i < nupdatable; i++)
+ {
+ itemsz = IndexTupleSize(updated[i]);
+ memcpy(updated_buf + offset, (char *) updated[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == updated_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuple updates */
+ for (int i = 0; i < nupdatable; i++)
+ {
+ /*
+ * Delete the old posting tuple first. This will also clear the
+ * LP_DEAD bit. (It would be correct to leave it set, but we're going
+ * to unset the BTP_HAS_GARBAGE bit anyway.)
+ */
+ PageIndexTupleDelete(page, updateitemnos[i]);
+
+ itemsz = IndexTupleSize(updated[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with updated ItemPointers to the page */
+ if (PageAddItem(page, (Item) updated[i], itemsz, updateitemnos[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+ }
+
/* Fix the page */
- PageIndexMultiDelete(page, deletable, ndeletable);
+ if (ndeletable > 0)
+ PageIndexMultiDelete(page, deletable, ndeletable);
/*
* We can clear the vacuum cycle ID since this page has certainly been
@@ -1015,6 +1127,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.ndeleted = ndeletable;
+ xlrec_vacuum.nupdated = nupdatable;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1025,8 +1138,22 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
* is. When XLogInsert stores the whole buffer, the offsets array
* need not be stored too.
*/
- XLogRegisterBufData(0, (char *) deletable, ndeletable *
- sizeof(OffsetNumber));
+ if (ndeletable > 0)
+ XLogRegisterBufData(0, (char *) deletable,
+ ndeletable * sizeof(OffsetNumber));
+
+ /*
+ * Here we should save offnums and updated tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle updated tuples and only after that other deleted items.
+ */
+ if (nupdatable > 0)
+ {
+ Assert(updated_buf != NULL);
+ XLogRegisterBufData(0, (char *) updateitemnos,
+ nupdatable * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, updated_buf, updated_sz);
+ }
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
@@ -1036,6 +1163,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
END_CRIT_SECTION();
}
+/*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+ Buffer buf, OffsetNumber *itemnos,
+ int nitems)
+{
+ ItemPointer htids;
+ TransactionId latestRemovedXid = InvalidTransactionId;
+ Page page = BufferGetPage(buf);
+ int arraynitems;
+ int finalnitems;
+
+ /*
+ * Initial size of array can fit everything when it turns out that are no
+ * posting lists
+ */
+ arraynitems = nitems;
+ htids = (ItemPointer) palloc(sizeof(ItemPointerData) * arraynitems);
+
+ finalnitems = 0;
+ /* identify what the index tuples about to be deleted point to */
+ for (int i = 0; i < nitems; i++)
+ {
+ ItemId itemid;
+ IndexTuple itup;
+
+ itemid = PageGetItemId(page, itemnos[i]);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(ItemIdIsDead(itemid));
+
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Make sure that we have space for additional heap TID */
+ if (finalnitems + 1 > arraynitems)
+ {
+ arraynitems = arraynitems * 2;
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ Assert(ItemPointerIsValid(&itup->t_tid));
+ ItemPointerCopy(&itup->t_tid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ else
+ {
+ int nposting = BTreeTupleGetNPosting(itup);
+
+ /* Make sure that we have space for additional heap TIDs */
+ if (finalnitems + nposting > arraynitems)
+ {
+ arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+ htids = (ItemPointer)
+ repalloc(htids, sizeof(ItemPointerData) * arraynitems);
+ }
+
+ for (int j = 0; j < nposting; j++)
+ {
+ ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+ Assert(ItemPointerIsValid(htid));
+ ItemPointerCopy(htid, &htids[finalnitems]);
+ finalnitems++;
+ }
+ }
+ }
+
+ Assert(finalnitems >= nitems);
+
+ /* determine the actual xid horizon */
+ latestRemovedXid =
+ table_compute_xid_horizon_for_tuples(heapRel, htids, finalnitems);
+
+ pfree(htids);
+
+ return latestRemovedXid;
+}
+
/*
* Delete item(s) from a btree page during single-page cleanup.
*
@@ -1046,7 +1258,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
* the page, but it needs to generate its own recovery conflicts by accessing
- * the heap. See comments for _bt_delitems_vacuum.
+ * the heap, and doesn't handle updating posting list tuples. See comments
+ * for _bt_delitems_vacuum.
*/
void
_bt_delitems_delete(Relation rel, Buffer buf,
@@ -1062,8 +1275,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
latestRemovedXid =
- index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
- itemnos, nitems);
+ _bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+ itemnos, nitems);
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
@@ -2061,6 +2274,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
xlmeta.fastlevel = metad->btm_fastlevel;
xlmeta.oldest_btpo_xact = metad->btm_oldest_btpo_xact;
xlmeta.last_cleanup_num_heap_tuples = metad->btm_last_cleanup_num_heap_tuples;
+ xlmeta.btm_safededup = metad->btm_safededup;
XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index bbc1376b0a..8a67193152 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -95,6 +95,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumposting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -158,7 +160,7 @@ btbuildempty(Relation index)
/* Construct metapage. */
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_opclasses_support_dedup(index));
/*
* Write the page and log it. It might seem that an immediate sync would
@@ -261,8 +263,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
*/
if (so->killedItems == NULL)
so->killedItems = (int *)
- palloc(MaxIndexTuplesPerPage * sizeof(int));
- if (so->numKilled < MaxIndexTuplesPerPage)
+ palloc(MaxBTreeIndexTuplesPerPage * sizeof(int));
+ if (so->numKilled < MaxBTreeIndexTuplesPerPage)
so->killedItems[so->numKilled++] = so->currPos.itemIndex;
}
@@ -1151,8 +1153,17 @@ restart:
}
else if (P_ISLEAF(opaque))
{
+ /* Deletable item state */
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable;
+ int nhtidsdead;
+ int nhtidslive;
+
+ /* Updatable item state (for posting lists) */
+ IndexTuple updated[MaxOffsetNumber];
+ OffsetNumber updatable[MaxOffsetNumber];
+ int nupdatable;
+
OffsetNumber offnum,
minoff,
maxoff;
@@ -1185,6 +1196,10 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nupdatable = 0;
+ /* Maintain stats counters for index tuple versions/heap TIDs */
+ nhtidsdead = 0;
+ nhtidslive = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1194,11 +1209,9 @@ restart:
offnum = OffsetNumberNext(offnum))
{
IndexTuple itup;
- ItemPointer htup;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
/*
* During Hot Standby we currently assume that it's okay that
@@ -1221,8 +1234,71 @@ restart:
* applies to *any* type of index that marks index tuples as
* killed.
*/
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Regular tuple, standard heap TID representation */
+ ItemPointer htid = &(itup->t_tid);
+
+ if (callback(htid, callback_state))
+ {
+ deletable[ndeletable++] = offnum;
+ nhtidsdead++;
+ }
+ else
+ nhtidslive++;
+ }
+ else
+ {
+ ItemPointer newhtids;
+ int nremaining;
+
+ /*
+ * Posting list tuple, a physical tuple that represents
+ * two or more logical tuples, any of which could be an
+ * index row version that must be removed
+ */
+ newhtids = btreevacuumposting(vstate, itup, &nremaining);
+ if (newhtids == NULL)
+ {
+ /*
+ * All TIDs/logical tuples from the posting tuple
+ * remain, so no update or delete required
+ */
+ Assert(nremaining == BTreeTupleGetNPosting(itup));
+ }
+ else if (nremaining > 0)
+ {
+ IndexTuple updatedtuple;
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * for when we update it in place
+ */
+ Assert(nremaining < BTreeTupleGetNPosting(itup));
+ updatedtuple = _bt_form_posting(itup, newhtids,
+ nremaining);
+ updated[nupdatable] = updatedtuple;
+ updatable[nupdatable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup) - nremaining;
+ pfree(newhtids);
+ }
+ else
+ {
+ /*
+ * All TIDs/logical tuples from the posting list must
+ * be deleted. We'll delete the physical tuple
+ * completely.
+ */
+ deletable[ndeletable++] = offnum;
+ nhtidsdead += BTreeTupleGetNPosting(itup);
+
+ /* Free empty array of live items */
+ pfree(newhtids);
+ }
+
+ nhtidslive += nremaining;
+ }
}
}
@@ -1230,11 +1306,12 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nupdatable > 0)
{
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable);
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
+ updated, nupdatable);
- stats->tuples_removed += ndeletable;
+ stats->tuples_removed += nhtidsdead;
/* must recompute maxoff */
maxoff = PageGetMaxOffsetNumber(page);
}
@@ -1249,6 +1326,7 @@ restart:
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
*/
+ Assert(nhtidsdead == 0);
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid)
{
@@ -1258,15 +1336,16 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples. We
- * don't delete when recursing, though, to avoid putting entries into
+ * If it's now empty, try to delete; else count the live tuples (live
+ * heap TIDs in posting lists are counted as live tuples). We don't
+ * delete when recursing, though, to avoid putting entries into
* freePages out-of-order (doesn't seem worth any extra code to handle
* the case).
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
else
- stats->num_index_tuples += maxoff - minoff + 1;
+ stats->num_index_tuples += nhtidslive;
}
if (delete_now)
@@ -1309,6 +1388,68 @@ restart:
}
}
+/*
+ * btreevacuumposting() -- determines which logical tuples must remain when
+ * VACUUMing a posting list tuple.
+ *
+ * Returns new palloc'd array of item pointers needed to build replacement
+ * posting list without the index row versions that are to be deleted.
+ *
+ * Note that returned array is NULL in the common case where there is nothing
+ * to delete in caller's posting list tuple. The number of TIDs that should
+ * remain in the posting list tuple is set for caller in *nremaining. This is
+ * also the size of the returned array (though only when array isn't just
+ * NULL).
+ */
+static ItemPointer
+btreevacuumposting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int live = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ Assert(BTreeTupleIsPosting(itup));
+
+ /*
+ * Check each tuple in the posting list. Save live tuples into tmpitems,
+ * though try to avoid memory allocation as an optimization.
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (!vstate->callback(items + i, vstate->callback_state))
+ {
+ /*
+ * Live heap TID.
+ *
+ * Only save live TID when we know that we're going to have to
+ * kill at least one TID, and have already allocated memory.
+ */
+ if (tmpitems)
+ tmpitems[live] = items[i];
+ live++;
+ }
+
+ /* Dead heap TID */
+ else if (tmpitems == NULL)
+ {
+ /*
+ * Turns out we need to delete one or more dead heap TIDs, so
+ * start maintaining an array of live TIDs for caller to
+ * reconstruct smaller replacement posting list tuple
+ */
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ /* Copy live heap TIDs from previous loop iterations */
+ if (live > 0)
+ memcpy(tmpitems, items, sizeof(ItemPointerData) * live);
+ }
+ }
+
+ *nremaining = live;
+ return tmpitems;
+}
+
/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e512461a0..c954926f2d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int _bt_binsrch_posting(BTScanInsert key, Page page,
+ OffsetNumber offnum);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer heapTid,
+ IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum,
+ ItemPointer heapTid);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
* low) makes bounds invalid.
*
* Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's postingoff field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
*/
OffsetNumber
_bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
Assert(P_ISLEAF(opaque));
Assert(!key->nextkey);
+ Assert(insertstate->postingoff == 0);
if (!insertstate->bounds_valid)
{
@@ -509,6 +521,16 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
if (result != 0)
stricthigh = high;
}
+
+ /*
+ * If tuple at offset located by binary search is a posting list whose
+ * TID range overlaps with caller's scantid, perform posting list
+ * binary search to set postingoff for caller. Caller must split the
+ * posting list when postingoff is set. This should happen
+ * infrequently.
+ */
+ if (unlikely(result == 0 && key->scantid != NULL))
+ insertstate->postingoff = _bt_binsrch_posting(key, page, mid);
}
/*
@@ -528,6 +550,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*----------
+ * _bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+ IndexTuple itup;
+ ItemId itemid;
+ int low,
+ high,
+ mid,
+ res;
+
+ /*
+ * If this isn't a posting tuple, then the index must be corrupt (if it is
+ * an ordinary non-pivot tuple then there must be an existing tuple with a
+ * heap TID that equals inserter's new heap TID/scantid). Defensively
+ * check that tuple is a posting list tuple whose posting list range
+ * includes caller's scantid.
+ *
+ * (This is also needed because contrib/amcheck's rootdescend option needs
+ * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+ */
+ itemid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, itemid);
+ if (!BTreeTupleIsPosting(itup))
+ return 0;
+
+ /*
+ * In the unlikely event that posting list tuple has LP_DEAD bit set,
+ * signal to caller that it should kill the item and restart its binary
+ * search.
+ */
+ if (ItemIdIsDead(itemid))
+ return -1;
+
+ /* "high" is past end of posting list for loop invariant */
+ low = 0;
+ high = BTreeTupleGetNPosting(itup);
+ Assert(high >= 2);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res > 0)
+ low = mid + 1;
+ else if (res < 0)
+ high = mid;
+ else
+ return mid;
+ }
+
+ /* Exact match not found */
+ return low;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -537,9 +621,14 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
* <0 if scankey < tuple at offnum;
* 0 if scankey == tuple at offnum;
* >0 if scankey > tuple at offnum.
- * NULLs in the keys are treated as sortable values. Therefore
- * "equality" does not necessarily mean that the item should be
- * returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values. Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key. Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid. There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
*
* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
@@ -563,6 +652,7 @@ _bt_compare(Relation rel,
ScanKey scankey;
int ncmpkey;
int ntupatts;
+ int32 result;
Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +687,6 @@ _bt_compare(Relation rel,
{
Datum datum;
bool isNull;
- int32 result;
datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
@@ -713,8 +802,25 @@ _bt_compare(Relation rel,
if (heapTid == NULL)
return 1;
+ /*
+ * scankey must be treated as equal to a posting list tuple if its scantid
+ * value falls within the range of the posting list. In all other cases
+ * there can only be a single heap TID value, which is compared directly
+ * as a simple scalar value.
+ */
Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ result = ItemPointerCompare(key->scantid, heapTid);
+ if (result <= 0 || !BTreeTupleIsPosting(itup))
+ return result;
+ else
+ {
+ result = ItemPointerCompare(key->scantid,
+ BTreeTupleGetMaxHeapTID(itup));
+ if (result > 0)
+ return 1;
+ }
+
+ return 0;
}
/*
@@ -1230,6 +1336,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/* Initialize remaining insertion scan key fields */
inskey.heapkeyspace = _bt_heapkeyspace(rel);
+ inskey.safededup = false; /* unused */
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1451,6 +1558,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.postingTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1484,9 +1592,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
- /* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /*
+ * Set up state to return posting list, and remember first
+ * "logical" tuple
+ */
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, 0),
+ itup);
+ itemIndex++;
+ /* Remember additional logical tuples */
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1519,7 +1649,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxBTreeIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1527,7 +1657,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxBTreeIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1568,9 +1698,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
&continuescan);
if (passes_quals && tuple_alive)
{
- /* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ /* tuple passes all scan key conditions */
+ if (!BTreeTupleIsPosting(itup))
+ {
+ /* Remember it */
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ int i = BTreeTupleGetNPosting(itup) - 1;
+
+ /*
+ * Set up state to return posting list, and remember last
+ * "logical" tuple (since we'll return it first)
+ */
+ itemIndex--;
+ _bt_setuppostingitems(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i--),
+ itup);
+
+ /*
+ * Remember additional logical tuples (use desc order to
+ * be consistent with order of entire scan)
+ */
+ for (; i >= 0; i--)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i));
+ }
+ }
}
if (!continuescan)
{
@@ -1584,8 +1742,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxBTreeIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxBTreeIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1756,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1610,6 +1770,64 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/*
+ * Setup state to save posting items from a single posting list tuple. Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first. Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid, IndexTuple itup)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ /* Save base IndexTuple (truncate posting list) */
+ IndexTuple base;
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ itupsz = MAXALIGN(itupsz);
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ base = (IndexTuple) (so->currTuples + so->currPos.nextTupleOffset);
+ memcpy(base, itup, itupsz);
+ /* Defensively reduce work area index tuple header size */
+ base->t_info &= ~INDEX_SIZE_MASK;
+ base->t_info |= itupsz;
+ so->currPos.nextTupleOffset += itupsz;
+ so->currPos.postingTupleOffset = currItem->tupleOffset;
+ }
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer heapTid)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *heapTid;
+ currItem->indexOffset = offnum;
+
+ /*
+ * Have index-only scans return the same base IndexTuple for every logical
+ * tuple that originates from the same posting list
+ */
+ if (so->currTuples)
+ currItem->tupleOffset = so->currPos.postingTupleOffset;
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b40559d45f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -243,6 +243,7 @@ typedef struct BTPageState
BlockNumber btps_blkno; /* block # to write this page at */
IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */
OffsetNumber btps_lastoff; /* last item offset loaded */
+ Size btps_lastextra; /* last item's extra posting list space */
uint32 btps_level; /* tree level (0 = leaf) */
Size btps_full; /* "full" if less than this much free space */
struct BTPageState *btps_next; /* link to parent level, if any */
@@ -277,7 +278,10 @@ static void _bt_slideleft(Page page);
static void _bt_sortaddtup(Page page, Size itemsize,
IndexTuple itup, OffsetNumber itup_off);
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
- IndexTuple itup);
+ IndexTuple itup, Size truncextra);
+static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+ BTPageState *state,
+ BTDedupState dstate);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
@@ -711,6 +715,7 @@ _bt_pagestate(BTWriteState *wstate, uint32 level)
state->btps_lowkey = NULL;
/* initialize lastoff so first item goes into P_FIRSTKEY */
state->btps_lastoff = P_HIKEY;
+ state->btps_lastextra = 0;
state->btps_level = level;
/* set "full" threshold based on level. See notes at head of file. */
if (level > 0)
@@ -789,7 +794,8 @@ _bt_sortaddtup(Page page,
}
/*----------
- * Add an item to a disk page from the sort output.
+ * Add an item to a disk page from the sort output (or add a posting list
+ * item formed from the sort output).
*
* We must be careful to observe the page layout conventions of nbtsearch.c:
* - rightmost pages start data items at P_HIKEY instead of at P_FIRSTKEY.
@@ -821,14 +827,27 @@ _bt_sortaddtup(Page page,
* the truncated high key at offset 1.
*
* 'last' pointer indicates the last offset added to the page.
+ *
+ * 'truncextra' is the size of the posting list in itup, if any. This
+ * information is stashed for the next call here, when we may benefit
+ * from considering the impact of truncating away the posting list on
+ * the page before deciding to finish the page off. Posting lists are
+ * often relatively large, so it is worth going to the trouble of
+ * accounting for the saving from truncating away the posting list of
+ * the tuple that becomes the high key (that may be the only way to
+ * get close to target free space on the page). Note that this is
+ * only used for the soft fillfactor-wise limit, not the critical hard
+ * limit.
*----------
*/
static void
-_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
+_bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
+ Size truncextra)
{
Page npage;
BlockNumber nblkno;
OffsetNumber last_off;
+ Size last_truncextra;
Size pgspc;
Size itupsz;
bool isleaf;
@@ -842,6 +861,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
npage = state->btps_page;
nblkno = state->btps_blkno;
last_off = state->btps_lastoff;
+ last_truncextra = state->btps_lastextra;
+ state->btps_lastextra = truncextra;
pgspc = PageGetFreeSpace(npage);
itupsz = IndexTupleSize(itup);
@@ -883,10 +904,10 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* page. Disregard fillfactor and insert on "full" current page if we
* don't have the minimum number of items yet. (Note that we deliberately
* assume that suffix truncation neither enlarges nor shrinks new high key
- * when applying soft limit.)
+ * when applying soft limit, except when last tuple had a posting list.)
*/
if (pgspc < itupsz + (isleaf ? MAXALIGN(sizeof(ItemPointerData)) : 0) ||
- (pgspc < state->btps_full && last_off > P_FIRSTKEY))
+ (pgspc + last_truncextra < state->btps_full && last_off > P_FIRSTKEY))
{
/*
* Finish off the page and write it out.
@@ -944,11 +965,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* We don't try to bias our choice of split point to make it more
* likely that _bt_truncate() can truncate away more attributes,
* whereas the split point used within _bt_split() is chosen much
- * more delicately. Suffix truncation is mostly useful because it
- * improves space utilization for workloads with random
- * insertions. It doesn't seem worthwhile to add logic for
- * choosing a split point here for a benefit that is bound to be
- * much smaller.
+ * more delicately. On the other hand, non-unique index builds
+ * usually deduplicate, which often results in every "physical"
+ * tuple on the page having distinct key values. When that
+ * happens, _bt_truncate() will never need to include a heap TID
+ * in the new high key.
*
* Overwrite the old item with new truncated high key directly.
* oitup is already located at the physical beginning of tuple
@@ -983,7 +1004,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(BTreeTupleGetNAtts(state->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST((BTPageOpaque) PageGetSpecialPointer(opage)));
BTreeInnerTupleSetDownLink(state->btps_lowkey, oblkno);
- _bt_buildadd(wstate, state->btps_next, state->btps_lowkey);
+ _bt_buildadd(wstate, state->btps_next, state->btps_lowkey, 0);
pfree(state->btps_lowkey);
/*
@@ -1045,6 +1066,47 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
state->btps_lastoff = last_off;
}
+/*
+ * Finalize pending posting list tuple, and add it to the index. Final tuple
+ * is based on saved base tuple, and saved list of heap TIDs.
+ *
+ * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
+ * using _bt_buildadd() and does not maintain the intervals array.
+ */
+static void
+_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+ BTDedupState dstate)
+{
+ IndexTuple final;
+ Size truncextra;
+
+ Assert(dstate->nitems > 0);
+ truncextra = 0;
+ if (dstate->nitems == 1)
+ final = dstate->base;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = _bt_form_posting(dstate->base,
+ dstate->htids,
+ dstate->nhtids);
+ final = postingtuple;
+ /* Determine size of posting list */
+ truncextra = IndexTupleSize(final) -
+ BTreeTupleGetPostingOffset(final);
+ }
+
+ _bt_buildadd(wstate, state, final, truncextra);
+
+ if (dstate->nitems > 1)
+ pfree(final);
+ /* Don't maintain dedup_intervals array, or alltupsize */
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+}
+
/*
* Finish writing out the completed btree.
*/
@@ -1090,7 +1152,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
Assert(BTreeTupleGetNAtts(s->btps_lowkey, wstate->index) == 0 ||
!P_LEFTMOST(opaque));
BTreeInnerTupleSetDownLink(s->btps_lowkey, blkno);
- _bt_buildadd(wstate, s->btps_next, s->btps_lowkey);
+ _bt_buildadd(wstate, s->btps_next, s->btps_lowkey, 0);
pfree(s->btps_lowkey);
s->btps_lowkey = NULL;
}
@@ -1111,7 +1173,8 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
* by filling in a valid magic number in the metapage.
*/
metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, rootblkno, rootlevel);
+ _bt_initmetapage(metapage, rootblkno, rootlevel,
+ wstate->inskey->safededup);
_bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
}
@@ -1132,6 +1195,9 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool deduplicate;
+
+ deduplicate = wstate->inskey->safededup && BTGetUseDedup(wstate->index);
if (merge)
{
@@ -1228,12 +1294,12 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (load1)
{
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
itup = tuplesort_getindextuple(btspool->sortstate, true);
}
else
{
- _bt_buildadd(wstate, state, itup2);
+ _bt_buildadd(wstate, state, itup2, 0);
itup2 = tuplesort_getindextuple(btspool2->sortstate, true);
}
@@ -1243,9 +1309,113 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
pfree(sortKeys);
}
+ else if (deduplicate)
+ {
+ /* merge is unnecessary, deduplicate into posting lists */
+ BTDedupState dstate;
+ IndexTuple newbase;
+
+ dstate = (BTDedupState) palloc(sizeof(BTDedupStateData));
+ dstate->maxitemsize = 0; /* set later */
+ dstate->checkingunique = false; /* unused */
+ dstate->skippedbase = InvalidOffsetNumber;
+ dstate->newitem = NULL;
+ /* Metadata about current pending posting list */
+ dstate->htids = NULL;
+ dstate->nhtids = 0;
+ dstate->nitems = 0;
+ dstate->overlap = false;
+ dstate->alltupsize = 0; /* unused */
+ /* Metadata about based tuple of current pending posting list */
+ dstate->base = NULL;
+ dstate->baseoff = InvalidOffsetNumber; /* unused */
+ dstate->basetupsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+
+ /*
+ * Limit size of posting list tuples to the size of the free
+ * space we want to leave behind on the page, plus space for
+ * final item's line pointer (but make sure that posting list
+ * tuple size won't exceed the generic 1/3 of a page limit).
+ *
+ * This is more conservative than the approach taken in the
+ * retail insert path, but it allows us to get most of the
+ * space savings deduplication provides without noticeably
+ * impacting how much free space is left behind on each leaf
+ * page.
+ */
+ dstate->maxitemsize =
+ Min(BTMaxItemSize(state->btps_page),
+ MAXALIGN_DOWN(state->btps_full) - sizeof(ItemIdData));
+ /* Minimum posting tuple size used here is arbitrary: */
+ dstate->maxitemsize = Max(dstate->maxitemsize, 100);
+ dstate->htids = palloc(dstate->maxitemsize);
+
+ /*
+ * No previous/base tuple, since itup is the first item
+ * returned by the tuplesort -- use itup as base tuple of
+ * first pending posting list for entire index build
+ */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+ else if (_bt_keep_natts_fast(wstate->index, dstate->base,
+ itup) > keysz &&
+ _bt_dedup_save_htid(dstate, itup))
+ {
+ /*
+ * Tuple is equal to base tuple of pending posting list, and
+ * merging itup into pending posting list won't exceed the
+ * maxitemsize limit. Heap TID(s) for itup have been saved in
+ * state. The next iteration will also end up here if it's
+ * possible to merge the next tuple into the same pending
+ * posting list.
+ */
+ }
+ else
+ {
+ /*
+ * Tuple is not equal to pending posting list tuple, or
+ * maxitemsize limit was reached
+ */
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+
+ /* itup starts new pending posting list */
+ newbase = CopyIndexTuple(itup);
+ _bt_dedup_start_pending(dstate, newbase, InvalidOffsetNumber);
+ }
+
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+
+ /*
+ * Handle the last item (there must be a last item when the tuplesort
+ * returned one or more tuples)
+ */
+ if (state)
+ {
+ _bt_sort_dedup_finish_pending(wstate, state, dstate);
+ /* Base tuple is always a copy */
+ pfree(dstate->base);
+ pfree(dstate->htids);
+ }
+
+ pfree(dstate);
+ }
else
{
- /* merge is unnecessary */
+ /* merging and deduplication are both unnecessary */
while ((itup = tuplesort_getindextuple(btspool->sortstate,
true)) != NULL)
{
@@ -1253,7 +1423,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
if (state == NULL)
state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup, 0);
/* Report progress */
pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 29167f1ef5..ffec42e78a 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -51,6 +51,7 @@ typedef struct
Size newitemsz; /* size of newitem (includes line pointer) */
bool is_leaf; /* T if splitting a leaf page */
bool is_rightmost; /* T if splitting rightmost page on level */
+ bool is_deduped; /* T if posting list truncation expected */
OffsetNumber newitemoff; /* where the new item is to be inserted */
int leftspace; /* space available for items on left page */
int rightspace; /* space available for items on right page */
@@ -177,12 +178,16 @@ _bt_findsplitloc(Relation rel,
state.newitemsz = newitemsz;
state.is_leaf = P_ISLEAF(opaque);
state.is_rightmost = P_RIGHTMOST(opaque);
+ state.is_deduped = state.is_leaf && BTGetUseDedup(rel);
state.leftspace = leftspace;
state.rightspace = rightspace;
state.olddataitemstotal = olddataitemstotal;
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
+ /* newitem cannot be a posting list item */
+ Assert(!BTreeTupleIsPosting(newitem));
+
/*
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
@@ -459,6 +464,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsz = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -468,8 +474,31 @@ _bt_recsplitloc(FindSplitData *state,
if (newitemisfirstonright)
firstrightitemsz = state->newitemsz;
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /*
+ * Calculate suffix truncation space saving when firstright is a
+ * posting list tuple.
+ *
+ * Individual posting lists often take up a significant fraction of
+ * all space on a page. Failing to consider that the new high key
+ * won't need to store the posting list a second time really matters.
+ */
+ if (state->is_leaf && state->is_deduped)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsz = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +521,11 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead.
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsz) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
@@ -691,7 +722,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
tup = (IndexTuple) PageGetItem(state->page, itemid);
/* Do cheaper test first */
- if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+ if (BTreeTupleIsPosting(tup) ||
+ !_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
return false;
/* Check same conditions as rightmost item case, too */
keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ee972a1465..cb6a5b9335 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -20,6 +20,7 @@
#include "access/nbtree.h"
#include "access/reloptions.h"
#include "access/relscan.h"
+#include "catalog/catalog.h"
#include "commands/progress.h"
#include "lib/qunique.h"
#include "miscadmin.h"
@@ -98,8 +99,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
indoption = rel->rd_indoption;
tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
- Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
/*
* We'll execute search using scan key constructed on key columns.
* Truncated attributes and non-key attributes are omitted from the final
@@ -108,12 +107,25 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
key->heapkeyspace = itup == NULL || _bt_heapkeyspace(rel);
+ key->safededup = itup == NULL ? _bt_opclasses_support_dedup(rel) :
+ _bt_safededup(rel);
key->anynullkeys = false; /* initial assumption */
key->nextkey = false;
key->pivotsearch = false;
+ key->scantid = NULL;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+ Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+ /*
+ * When caller passes a tuple with a heap TID, use it to set scantid. Note
+ * that this handles posting list tuples by setting scantid to the lowest
+ * heap TID in the posting list.
+ */
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1373,6 +1385,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
continue;
}
@@ -1534,6 +1547,7 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
* attribute passes the qual.
*/
Assert(ScanDirectionIsForward(dir));
+ Assert(BTreeTupleIsPivot(tuple));
cmpresult = 0;
if (subkey->sk_flags & SK_ROW_END)
break;
@@ -1773,10 +1787,35 @@ _bt_killitems(IndexScanDesc scan)
{
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ bool killtuple = false;
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ if (BTreeTupleIsPosting(ituple))
{
- /* found the item */
+ int pi = i + 1;
+ int nposting = BTreeTupleGetNPosting(ituple);
+ int j;
+
+ for (j = 0; j < nposting; j++)
+ {
+ ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+ if (!ItemPointerEquals(item, &kitem->heapTid))
+ break; /* out of posting list loop */
+
+ /* Read-ahead to later kitems */
+ if (pi < numKilled)
+ kitem = &so->currPos.items[so->killedItems[pi++]];
+ }
+
+ if (j == nposting)
+ killtuple = true;
+ }
+ else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ killtuple = true;
+
+ if (killtuple)
+ {
+ /* found the item/all posting list items */
ItemIdMarkDead(iid);
killedsomething = true;
break; /* out of inner search loop */
@@ -2017,7 +2056,9 @@ btoptions(Datum reloptions, bool validate)
static const relopt_parse_elt tab[] = {
{"fillfactor", RELOPT_TYPE_INT, offsetof(BTOptions, fillfactor)},
{"vacuum_cleanup_index_scale_factor", RELOPT_TYPE_REAL,
- offsetof(BTOptions, vacuum_cleanup_index_scale_factor)}
+ offsetof(BTOptions, vacuum_cleanup_index_scale_factor)},
+ {"deduplication", RELOPT_TYPE_BOOL,
+ offsetof(BTOptions, deduplication)}
};
@@ -2138,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ if (keepnatts == natts)
+ {
+ /*
+ * index_truncate_tuple() just returned a copy of the
+ * original, so make sure that the size of the new pivot tuple
+ * doesn't have posting list overhead
+ */
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+ }
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2154,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft) &&
+ !BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2161,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. We
+ * can always truncate away a posting list, though.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+ }
else
{
/*
@@ -2186,6 +2265,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* nbtree (e.g., there is no pg_attribute entry).
*/
Assert(itup_key->heapkeyspace);
+ Assert(!BTreeTupleIsPosting(pivot));
pivot->t_info &= ~INDEX_SIZE_MASK;
pivot->t_info |= newsize;
@@ -2198,7 +2278,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxHeapTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2209,9 +2289,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxHeapTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2224,7 +2307,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2233,7 +2316,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2314,13 +2398,16 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* The approach taken here usually provides the same answer as _bt_keep_natts
* will (for the same pair of tuples from a heapkeyspace index), since the
* majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal after detoasting.
+ * unless they're bitwise equal after detoasting. When an index is considered
+ * deduplication-safe by _bt_opclasses_support_dedup, routine is guaranteed to
+ * give the same result as _bt_keep_natts would.
*
- * These issues must be acceptable to callers, typically because they're only
- * concerned about making suffix truncation as effective as possible without
- * leaving excessive amounts of free space on either side of page split.
- * Callers can rely on the fact that attributes considered equal here are
- * definitely also equal according to _bt_keep_natts.
+ * Suffix truncation callers can rely on the fact that attributes considered
+ * equal here are definitely also equal according to _bt_keep_natts, even when
+ * the index uses an opclass or collation that is not deduplication-safe.
+ * This weaker guarantee is good enough for these callers, since false
+ * negatives generally only have the effect of making leaf page splits use a
+ * more balanced split point.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2398,22 +2485,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
tupnatts = BTreeTupleGetNAtts(itup, rel);
+ /* !heapkeyspace indexes do not support deduplication */
+ if (!heapkeyspace && BTreeTupleIsPosting(itup))
+ return false;
+
+ /* INCLUDE indexes do not support deduplication */
+ if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+ return false;
+
if (P_ISLEAF(opaque))
{
if (offnum >= P_FIRSTDATAKEY(opaque))
{
/*
- * Non-pivot tuples currently never use alternative heap TID
- * representation -- even those within heapkeyspace indexes
+ * Non-pivot tuple should never be explicitly marked as a pivot
+ * tuple
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
* Leaf tuples that are not the page high key (non-pivot tuples)
* should never be truncated. (Note that tupnatts must have been
- * inferred, rather than coming from an explicit on-disk
- * representation.)
+ * inferred, even with a posting list tuple, because only pivot
+ * tuples store tupnatts directly.)
*/
return tupnatts == natts;
}
@@ -2457,12 +2552,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* non-zero, or when there is no explicit representation and the
* tuple is evidently not a pre-pg_upgrade tuple.
*
- * Prior to v11, downlinks always had P_HIKEY as their offset. Use
- * that to decide if the tuple is a pre-v11 tuple.
+ * Prior to v11, downlinks always had P_HIKEY as their offset.
+ * Accept that as an alternative indication of a valid
+ * !heapkeyspace negative infinity tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
- ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+ ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
}
else
{
@@ -2488,7 +2583,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
+ return false;
+
+ /* Pivot tuple should not use posting list representation (redundant) */
+ if (BTreeTupleIsPosting(itup))
return false;
/*
@@ -2558,11 +2657,54 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
BTMaxItemSizeNoHeapTid(page),
RelationGetRelationName(rel)),
errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
- ItemPointerGetBlockNumber(&newtup->t_tid),
- ItemPointerGetOffsetNumber(&newtup->t_tid),
+ ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+ ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
RelationGetRelationName(heap)),
errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
"Consider a function index of an MD5 hash of the value, "
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Is it safe to perform deduplication for an index, given the opclasses and
+ * collations used?
+ *
+ * Returned value is stored in index metapage during index builds. Function
+ * does not account for incompatibilities caused by index being on an earlier
+ * nbtree version.
+ */
+bool
+_bt_opclasses_support_dedup(Relation index)
+{
+ /* INCLUDE indexes don't support deduplication */
+ if (IndexRelationGetNumberOfAttributes(index) !=
+ IndexRelationGetNumberOfKeyAttributes(index))
+ return false;
+
+ /*
+ * There is no reason why deduplication cannot be used with system catalog
+ * indexes. However, we deem it generally unsafe because it's not clear
+ * how it could be disabled. (ALTER INDEX is not supported with system
+ * catalog indexes, so users have no way to set the "deduplicate" storage
+ * parameter.)
+ */
+ if (IsCatalogRelation(index))
+ return false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(index); i++)
+ {
+ Oid opfamily = index->rd_opfamily[i];
+ Oid collation = index->rd_indcollation[i];
+
+ /* TODO add adequate check of opclasses and collations */
+ elog(DEBUG4, "index %s column i %d opfamilyOid %u collationOid %u",
+ RelationGetRelationName(index), i, opfamily, collation);
+
+ /* NUMERIC btree opfamily OID is 1988 */
+ if (opfamily == 1988)
+ return false;
+ }
+
+ return true;
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 72a601bb22..191ab63a9b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,6 +22,9 @@
#include "access/xlogutils.h"
#include "miscadmin.h"
#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+static MemoryContext opCtx; /* working memory for operations */
/*
* _bt_restore_page -- re-enter all the index tuples on a page
@@ -111,6 +114,7 @@ _bt_restore_meta(XLogReaderState *record, uint8 block_id)
Assert(md->btm_version >= BTREE_NOVAC_VERSION);
md->btm_oldest_btpo_xact = xlrec->oldest_btpo_xact;
md->btm_last_cleanup_num_heap_tuples = xlrec->last_cleanup_num_heap_tuples;
+ md->btm_safededup = xlrec->btm_safededup;
pageop = (BTPageOpaque) PageGetSpecialPointer(metapg);
pageop->btpo_flags = BTP_META;
@@ -156,7 +160,8 @@ _bt_clear_incomplete_split(XLogReaderState *record, uint8 block_id)
}
static void
-btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
+btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
+ XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
@@ -181,9 +186,52 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
page = BufferGetPage(buffer);
- if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
- false, false) == InvalidOffsetNumber)
- elog(PANIC, "btree_xlog_insert: failed to add item");
+ if (likely(!posting))
+ {
+ /* Simple retail insertion */
+ if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add item");
+ }
+ else
+ {
+ ItemId itemid;
+ IndexTuple oposting,
+ newitem,
+ nposting;
+ uint16 postingoff;
+
+ /*
+ * A posting list split occurred during leaf page insertion. WAL
+ * record data will start with an offset number representing the
+ * point in an existing posting list that a split occurs at.
+ *
+ * Use _bt_swap_posting() to repeat posting list split steps from
+ * primary. Note that newitem from WAL record is 'orignewitem',
+ * not the final version of newitem that is actually inserted on
+ * page.
+ */
+ postingoff = *((uint16 *) datapos);
+ datapos += sizeof(uint16);
+ datalen -= sizeof(uint16);
+
+ itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+ oposting = (IndexTuple) PageGetItem(page, itemid);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ Assert(isleaf && postingoff > 0);
+ newitem = CopyIndexTuple((IndexTuple) datapos);
+ nposting = _bt_swap_posting(newitem, oposting, postingoff);
+
+ /* Replace existing posting list with post-split version */
+ memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+ /* insert "final" new item (not orignewitem from WAL stream) */
+ Assert(IndexTupleSize(newitem) == datalen);
+ if (PageAddItem(page, (Item) newitem, datalen, xlrec->offnum,
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_insert: failed to add posting split new item");
+ }
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
@@ -265,20 +313,38 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
OffsetNumber off;
IndexTuple newitem = NULL,
- left_hikey = NULL;
+ left_hikey = NULL,
+ nposting = NULL;
Size newitemsz = 0,
left_hikeysz = 0;
Page newlpage;
- OffsetNumber leftoff;
+ OffsetNumber leftoff,
+ replacepostingoff = InvalidOffsetNumber;
datapos = XLogRecGetBlockData(record, 0, &datalen);
- if (onleft)
+ if (onleft || xlrec->postingoff != 0)
{
newitem = (IndexTuple) datapos;
newitemsz = MAXALIGN(IndexTupleSize(newitem));
datapos += newitemsz;
datalen -= newitemsz;
+
+ if (xlrec->postingoff != 0)
+ {
+ ItemId itemid;
+ IndexTuple oposting;
+
+ /* Posting list must be at offset number before new item's */
+ replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+ /* newitem must be mutable copy for _bt_swap_posting() */
+ newitem = CopyIndexTuple(newitem);
+ itemid = PageGetItemId(lpage, replacepostingoff);
+ oposting = (IndexTuple) PageGetItem(lpage, itemid);
+ nposting = _bt_swap_posting(newitem, oposting,
+ xlrec->postingoff);
+ }
}
/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,8 +370,20 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
Size itemsz;
IndexTuple item;
+ /* Add replacement posting list when required */
+ if (off == replacepostingoff)
+ {
+ Assert(onleft || xlrec->firstright == xlrec->newitemoff);
+ if (PageAddItem(newlpage, (Item) nposting,
+ MAXALIGN(IndexTupleSize(nposting)), leftoff,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add new posting list item to left page after split");
+ leftoff = OffsetNumberNext(leftoff);
+ continue;
+ }
+
/* add the new item if it was inserted on left page */
- if (onleft && off == xlrec->newitemoff)
+ else if (onleft && off == xlrec->newitemoff)
{
if (PageAddItem(newlpage, (Item) newitem, newitemsz, leftoff,
false, false) == InvalidOffsetNumber)
@@ -379,6 +457,84 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
}
}
+static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ Buffer buf;
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+ if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+ {
+ /*
+ * Initialize a temporary empty page and copy all the items to that in
+ * item number order.
+ */
+ Page page = (Page) BufferGetPage(buf);
+ OffsetNumber offnum;
+ BTDedupState state;
+
+ state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+
+ state->maxitemsize = BTMaxItemSize(page);
+ state->checkingunique = false; /* unused */
+ state->skippedbase = InvalidOffsetNumber;
+ state->newitem = NULL;
+ /* Metadata about current pending posting list */
+ state->htids = NULL;
+ state->nhtids = 0;
+ state->nitems = 0;
+ state->alltupsize = 0;
+ state->overlap = false;
+ /* Metadata about based tuple of current pending posting list */
+ state->base = NULL;
+ state->baseoff = InvalidOffsetNumber;
+ state->basetupsize = 0;
+
+ /* Conservatively size array */
+ state->htids = palloc(state->maxitemsize);
+
+ /*
+ * Iterate over tuples on the page belonging to the interval to
+ * deduplicate them into a posting list.
+ */
+ for (offnum = xlrec->baseoff;
+ offnum < xlrec->baseoff + xlrec->nitems;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemid);
+
+ Assert(!ItemIdIsDead(itemid));
+
+ if (offnum == xlrec->baseoff)
+ {
+ /*
+ * No previous/base tuple for first data item -- use first
+ * data item as base tuple of first pending posting list
+ */
+ _bt_dedup_start_pending(state, itup, offnum);
+ }
+ else
+ {
+ /* Heap TID(s) for itup will be saved in state */
+ if (!_bt_dedup_save_htid(state, itup))
+ elog(ERROR, "could not add heap tid to pending posting list");
+ }
+ }
+
+ Assert(state->nitems == xlrec->nitems);
+ /* Handle the last item */
+ _bt_dedup_finish_pending(buf, state, false);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+
+ if (BufferIsValid(buf))
+ UnlockReleaseBuffer(buf);
+}
+
static void
btree_xlog_vacuum(XLogReaderState *record)
{
@@ -395,7 +551,38 @@ btree_xlog_vacuum(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
+ /*
+ * Must update posting list tuples before deleting whole items, since
+ * offset numbers are based on original page contents
+ */
+ if (xlrec->nupdated > 0)
+ {
+ OffsetNumber *updatedoffsets;
+ IndexTuple updated;
+ Size itemsz;
+
+ updatedoffsets = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ updated = (IndexTuple) ((char *) updatedoffsets +
+ xlrec->nupdated * sizeof(OffsetNumber));
+
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nupdated; i++)
+ {
+ PageIndexTupleDelete(page, updatedoffsets[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(updated));
+
+ if (PageAddItem(page, (Item) updated, itemsz, updatedoffsets[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "failed to add updated posting list item");
+
+ updated = (IndexTuple) ((char *) updated + itemsz);
+ }
+ }
+
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
/*
* Mark the page as not containing any LP_DEAD items --- see comments
@@ -729,17 +916,22 @@ void
btree_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+ MemoryContext oldCtx;
+ oldCtx = MemoryContextSwitchTo(opCtx);
switch (info)
{
case XLOG_BTREE_INSERT_LEAF:
- btree_xlog_insert(true, false, record);
+ btree_xlog_insert(true, false, false, record);
break;
case XLOG_BTREE_INSERT_UPPER:
- btree_xlog_insert(false, false, record);
+ btree_xlog_insert(false, false, false, record);
break;
case XLOG_BTREE_INSERT_META:
- btree_xlog_insert(false, true, record);
+ btree_xlog_insert(false, true, false, record);
+ break;
+ case XLOG_BTREE_INSERT_POST:
+ btree_xlog_insert(true, false, true, record);
break;
case XLOG_BTREE_SPLIT_L:
btree_xlog_split(true, record);
@@ -747,6 +939,9 @@ btree_redo(XLogReaderState *record)
case XLOG_BTREE_SPLIT_R:
btree_xlog_split(false, record);
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ btree_xlog_dedup(record);
+ break;
case XLOG_BTREE_VACUUM:
btree_xlog_vacuum(record);
break;
@@ -772,6 +967,23 @@ btree_redo(XLogReaderState *record)
default:
elog(PANIC, "btree_redo: unknown op code %u", info);
}
+ MemoryContextSwitchTo(oldCtx);
+ MemoryContextReset(opCtx);
+}
+
+void
+btree_xlog_startup(void)
+{
+ opCtx = AllocSetContextCreate(CurrentMemoryContext,
+ "Btree recovery temporary context",
+ ALLOCSET_DEFAULT_SIZES);
+}
+
+void
+btree_xlog_cleanup(void)
+{
+ MemoryContextDelete(opCtx);
+ opCtx = NULL;
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 497f8dc77e..23e951aa9e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -27,6 +27,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
case XLOG_BTREE_INSERT_LEAF:
case XLOG_BTREE_INSERT_UPPER:
case XLOG_BTREE_INSERT_META:
+ case XLOG_BTREE_INSERT_POST:
{
xl_btree_insert *xlrec = (xl_btree_insert *) rec;
@@ -38,15 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_split *xlrec = (xl_btree_split *) rec;
- appendStringInfo(buf, "level %u, firstright %d, newitemoff %d",
- xlrec->level, xlrec->firstright, xlrec->newitemoff);
+ appendStringInfo(buf, "level %u, firstright %d, newitemoff %d, postingoff %d",
+ xlrec->level,
+ xlrec->firstright,
+ xlrec->newitemoff,
+ xlrec->postingoff);
+ break;
+ }
+ case XLOG_BTREE_DEDUP_PAGE:
+ {
+ xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+ appendStringInfo(buf, "baseoff %u; nitems %u",
+ xlrec->baseoff, xlrec->nitems);
break;
}
case XLOG_BTREE_VACUUM:
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
+ appendStringInfo(buf, "ndeleted %u; nupdated %u",
+ xlrec->ndeleted, xlrec->nupdated);
break;
}
case XLOG_BTREE_DELETE:
@@ -130,6 +143,12 @@ btree_identify(uint8 info)
case XLOG_BTREE_SPLIT_R:
id = "SPLIT_R";
break;
+ case XLOG_BTREE_DEDUP_PAGE:
+ id = "DEDUPLICATE";
+ break;
+ case XLOG_BTREE_INSERT_POST:
+ id = "INSERT_POST";
+ break;
case XLOG_BTREE_VACUUM:
id = "VACUUM";
break;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..6b5d36de57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/nbtree.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/transam.h"
@@ -363,6 +364,23 @@ static const struct config_enum_entry backslash_quote_options[] = {
{NULL, 0, false}
};
+/*
+ * Although only "on", "off", and "nonunique" are documented, we accept all
+ * the likely variants of "on" and "off".
+ */
+static const struct config_enum_entry btree_deduplication_options[] = {
+ {"off", DEDUP_OFF, false},
+ {"on", DEDUP_ON, false},
+ {"nonunique", DEDUP_NONUNIQUE, false},
+ {"false", DEDUP_OFF, true},
+ {"true", DEDUP_ON, true},
+ {"no", DEDUP_OFF, true},
+ {"yes", DEDUP_ON, true},
+ {"0", DEDUP_OFF, true},
+ {"1", DEDUP_ON, true},
+ {NULL, 0, false}
+};
+
/*
* Although only "on", "off", and "partition" are documented, we
* accept all the likely variants of "on" and "off".
@@ -4271,6 +4289,16 @@ static struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"btree_deduplication", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Enables B-tree index deduplication optimization."),
+ NULL
+ },
+ &btree_deduplication,
+ DEDUP_NONUNIQUE, btree_deduplication_options,
+ NULL, NULL, NULL
+ },
+
{
{"bytea_output", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the output format for bytea."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 46a06ffacd..0b8aa56b3a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -650,6 +650,7 @@
#vacuum_cleanup_index_scale_factor = 0.1 # fraction of total number of tuples
# before index cleanup, 0 always performs
# index cleanup
+#btree_deduplication = 'nonunique' # off, on, or nonunique
#bytea_output = 'hex' # hex, escape
#xmlbinary = 'base64'
#xmloption = 'content'
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index df26826993..7e55c0ff90 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1677,14 +1677,14 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX <foo> SET|RESET ( */
else if (Matches("ALTER", "INDEX", MatchAny, "RESET", "("))
COMPLETE_WITH("fillfactor",
- "vacuum_cleanup_index_scale_factor", /* BTREE */
+ "vacuum_cleanup_index_scale_factor", "deduplication", /* BTREE */
"fastupdate", "gin_pending_list_limit", /* GIN */
"buffering", /* GiST */
"pages_per_range", "autosummarize" /* BRIN */
);
else if (Matches("ALTER", "INDEX", MatchAny, "SET", "("))
COMPLETE_WITH("fillfactor =",
- "vacuum_cleanup_index_scale_factor =", /* BTREE */
+ "vacuum_cleanup_index_scale_factor =", "deduplication =", /* BTREE */
"fastupdate =", "gin_pending_list_limit =", /* GIN */
"buffering =", /* GiST */
"pages_per_range =", "autosummarize =" /* BRIN */
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3542545de5..8b1223a817 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -145,6 +145,7 @@ static void bt_tuple_present_callback(Relation index, ItemPointer tid,
bool tupleIsAlive, void *checkstate);
static IndexTuple bt_normalize_tuple(BtreeCheckState *state,
IndexTuple itup);
+static inline IndexTuple bt_posting_logical_tuple(IndexTuple itup, int n);
static bool bt_rootdescend(BtreeCheckState *state, IndexTuple itup);
static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
OffsetNumber offset);
@@ -419,12 +420,13 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Size Bloom filter based on estimated number of tuples in index,
* while conservatively assuming that each block must contain at least
- * MaxIndexTuplesPerPage / 5 non-pivot tuples. (Non-leaf pages cannot
- * contain non-pivot tuples. That's okay because they generally make
- * up no more than about 1% of all pages in the index.)
+ * MaxBTreeIndexTuplesPerPage / 3 "logical" tuples. heapallindexed
+ * verification fingerprints posting list heap TIDs as plain non-pivot
+ * tuples, complete with index keys. This allows its heap scan to
+ * behave as if posting lists do not exist.
*/
total_pages = RelationGetNumberOfBlocks(rel);
- total_elems = Max(total_pages * (MaxIndexTuplesPerPage / 5),
+ total_elems = Max(total_pages * (MaxBTreeIndexTuplesPerPage / 3),
(int64) state->rel->rd_rel->reltuples);
/* Random seed relies on backend srandom() call to avoid repetition */
seed = random();
@@ -924,6 +926,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +997,72 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
char *itid,
*htid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
- htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ htid = psprintf("(%u,%u)", ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1120,32 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* Fingerprint all elements as distinct "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ IndexTuple logtuple;
+
+ logtuple = bt_posting_logical_tuple(itup, i);
+ norm = bt_normalize_tuple(state, logtuple);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != logtuple)
+ pfree(norm);
+ pfree(logtuple);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxHeapTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1160,15 +1231,17 @@ bt_target_page_check(BtreeCheckState *state)
if (OffsetNumberNext(offset) <= max &&
!invariant_l_offset(state, skey, OffsetNumberNext(offset)))
{
+ ItemPointer tid;
char *itid,
*htid,
*nitid,
*nhtid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, ItemPointer tid, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation. Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -1969,6 +2044,9 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
IndexTuple reformed;
int i;
+ /* Caller should only pass "logical" non-pivot tuples here */
+ Assert(!BTreeTupleIsPosting(itup) && !BTreeTupleIsPivot(itup));
+
/* Easy case: It's immediately clear that tuple has no varlena datums */
if (!IndexTupleHasVarwidths(itup))
return itup;
@@ -2031,6 +2109,30 @@ bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
return reformed;
}
+/*
+ * Produce palloc()'d "logical" tuple for nth posting list entry.
+ *
+ * In general, deduplication is not supposed to change the logical contents of
+ * an index. Multiple logical index tuples are folded together into one
+ * physical posting list index tuple when convenient.
+ *
+ * heapallindexed verification must normalize-away this variation in
+ * representation by converting posting list tuples into two or more "logical"
+ * tuples. Each logical tuple must be fingerprinted separately -- there must
+ * be one logical tuple for each corresponding Bloom filter probe during the
+ * heap scan.
+ *
+ * Note: Caller needs to call bt_normalize_tuple() with returned tuple.
+ */
+static inline IndexTuple
+bt_posting_logical_tuple(IndexTuple itup, int n)
+{
+ Assert(BTreeTupleIsPosting(itup));
+
+ /* Returns non-posting-list tuple */
+ return _bt_form_posting(itup, BTreeTupleGetPostingN(itup, n), 1);
+}
+
/*
* Search for itup in index, starting from fast root page. itup must be a
* non-pivot tuple. This is only supported with heapkeyspace indexes, since
@@ -2087,6 +2189,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
insertstate.itup = itup;
insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
insertstate.itup_key = key;
+ insertstate.postingoff = 0;
insertstate.bounds_valid = false;
insertstate.buf = lbuf;
@@ -2094,7 +2197,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
offnum = _bt_binsrch_insert(state->rel, &insertstate);
/* Compare first >= matching item on leaf page, if any */
page = BufferGetPage(lbuf);
+ /* Should match on first heap TID when tuple has a posting list */
if (offnum <= PageGetMaxOffsetNumber(page) &&
+ insertstate.postingoff <= 0 &&
_bt_compare(state->rel, key, page, offnum) == 0)
exists = true;
_bt_relbuf(state->rel, lbuf);
@@ -2548,26 +2653,25 @@ PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block, Page page,
}
/*
- * BTreeTupleGetHeapTID() wrapper that lets caller enforce that a heap TID must
- * be present in cases where that is mandatory.
- *
- * This doesn't add much as of BTREE_VERSION 4, since the INDEX_ALT_TID_MASK
- * bit is effectively a proxy for whether or not the tuple is a pivot tuple.
- * It may become more useful in the future, when non-pivot tuples support their
- * own alternative INDEX_ALT_TID_MASK representation.
+ * BTreeTupleGetHeapTID() wrapper that enforces that a heap TID is present in
+ * cases where that is mandatory (i.e. for non-pivot tuples).
*/
static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
- BlockNumber targetblock = state->targetblock;
+ Assert(state->heapkeyspace);
- if (result == NULL && nonpivot)
+ /*
+ * Make sure that tuple type (pivot vs non-pivot) matches caller's
+ * expectation
+ */
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
- targetblock, RelationGetRelationName(state->rel))));
+ state->targetblock,
+ RelationGetRelationName(state->rel))));
- return result;
+ return BTreeTupleGetHeapTID(itup);
}
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 5881ea5dd6..13d9b2ff96 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -433,11 +433,130 @@ returns bool
<sect1 id="btree-implementation">
<title>Implementation</title>
+ <para>
+ Internally, a B-tree index consists of a tree structure with leaf
+ pages. Each leaf page contains tuples that point to table entries
+ using a heap item pointer. Each tuple's key is considered unique
+ internally, since the item pointer is treated as part of the key.
+ </para>
+ <para>
+ An introduction to the btree index implementation can be found in
+ <filename>src/backend/access/nbtree/README</filename>.
+ </para>
+
+ <sect2 id="btree-deduplication">
+ <title>Deduplication</title>
<para>
- An introduction to the btree index implementation can be found in
- <filename>src/backend/access/nbtree/README</filename>.
+ B-Tree supports <firstterm>deduplication</firstterm>. Existing
+ leaf page tuples with fully equal keys (equal prior to the heap
+ item pointer) are merged together into a single <quote>posting
+ list</quote> tuple. The keys appear only once in this
+ representation. A simple array of heap item pointers follows.
+ Posting lists are formed <quote>lazily</quote>, when a new item is
+ inserted that cannot fit on an existing leaf page. The immediate
+ goal of the deduplication process is to at least free enough space
+ to fit the new item; otherwise a leaf page split occurs, which
+ allocates a new leaf page. The <firstterm>key space</firstterm>
+ covered by the original leaf page is shared among the original page,
+ and its new right sibling page.
+ </para>
+ <para>
+ Deduplication can greatly increase index space efficiency with data
+ sets where each distinct key appears at least a few times on
+ average. It can also reduce the cost of subsequent index scans,
+ especially when many leaf pages must be accessed. For example, an
+ index on a simple <type>integer</type> column that uses
+ deduplication will have a storage size that is only about 65% of an
+ equivalent unoptimized index when each distinct
+ <type>integer</type> value appears three times. If each distinct
+ <type>integer</type> value appears six times, the storage overhead
+ can be as low as 50% of baseline. With hundreds of duplicates per
+ distinct value (or with larger <quote>base</quote> key values) a
+ storage size of about <emphasis>one third</emphasis> of the
+ unoptimized case is expected. There is often a direct benefit for
+ queries, as well as an indirect benefit due to reduced I/O during
+ routine vacuuming.
+ </para>
+ <para>
+ Cases that don't benefit due to having no duplicate values will
+ incur a small performance penalty with mixed read-write workloads.
+ There is no performance penalty with read-only workloads, since
+ reading from posting lists is at least as efficient as reading the
+ standard index tuple representation.
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-configure">
+ <title>Configuring Deduplication</title>
+
+ <para>
+ The <xref linkend="guc-btree-deduplication"/> configuration
+ parameter controls deduplication. By default, deduplication is
+ only used with non-unique indexes. The
+ <literal>deduplication</literal> storage parameter can be used to
+ override the configuration paramater for individual indexes. See
+ <xref linkend="sql-createindex-storage-parameters"/> from the
+ <command>CREATE INDEX</command> documentation for details.
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-unique">
+ <title>Unique indexes and Deduplication</title>
+
+ <para>
+ Unique indexes can also use deduplication, despite the fact that
+ unique indexes do not <emphasis>logically</emphasis> contain
+ duplicates; implementation-level <emphasis>physical</emphasis>
+ duplicates may still be present. Unique indexes that are prone to
+ becoming bloated due to a short term burst in updates are good
+ candidates. <command>VACUUM</command> will eventually remove dead
+ versions of tuples from unique indexes, but it may not be possible
+ for it to do so before some number of <quote>unnecessary</quote>
+ page splits have taken place. Deduplication can prevent these page
+ splits from happening. Note that page splits can only be reversed
+ by <command>VACUUM</command> when the page is
+ <emphasis>completely</emphasis> empty, which isn't expected in this
+ scenario.
+ </para>
+ <para>
+ In other cases, deduplication can be effective with unique indexes
+ just because of the presence of many <literal>NULL</literal> values
+ in the unique index. The influence of <xref
+ linkend="guc-vacuum-cleanup-index-scale-factor"/> must also be
+ considered.
+ </para>
+ <para>
+ For more information about automatic and manual vacuuming, see
+ <xref linkend="routine-vacuuming"/>. Note that the heap-only tuple
+ (<acronym>HOT</acronym>) optimization can also prevent page splits
+ caused only by versioned tuples rather than by insertions of new
+ values.
</para>
+ </sect2>
+
+ <sect2 id="btree-deduplication-restrictions">
+ <title>Restrictions</title>
+
+ <para>
+ Deduplication can only be used with indexes that use B-Tree
+ operator classes that were declared <literal>BITWISE</literal>. In
+ practice almost all datatypes support deduplication, though
+ <type>numeric</type> is a notable exception (the <quote>display
+ scale</quote> feature makes it impossible to enable deduplication
+ without losing useful information about equal <type>numeric</type>
+ datums). Deduplication is not supported with nondeterministic
+ collations, nor is it supported with <literal>INCLUDE</literal>
+ indexes.
+ </para>
+ <para>
+ Note that a multicolumn index is only considered to have duplicates
+ when there are index entries that repeat entire
+ <emphasis>combinations</emphasis> of values (the values stored in
+ each and every column must be equal).
+ </para>
+
+ </sect2>
</sect1>
</chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 55669b5cad..9f371d3e3a 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
nondeterministic collations give a more <quote>correct</quote> behavior,
especially when considering the full power of Unicode and its many
special cases, they also have some drawbacks. Foremost, their use leads
- to a performance penalty. Also, certain operations are not possible with
- nondeterministic collations, such as pattern matching operations.
- Therefore, they should be used only in cases where they are specifically
- wanted.
+ to a performance penalty. Note, in particular, that B-tree cannot use
+ deduplication with indexes that use a nondeterministic collation. Also,
+ certain operations are not possible with nondeterministic collations,
+ such as pattern matching operations. Therefore, they should be used
+ only in cases where they are specifically wanted.
</para>
</sect3>
</sect2>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..6f89e4a51f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8000,6 +8000,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
</listitem>
</varlistentry>
+ <varlistentry id="guc-btree-deduplication" xreflabel="btree_deduplication">
+ <term><varname>btree_deduplication</varname> (<type>enum</type>)
+ <indexterm>
+ <primary><varname>btree_deduplication</varname></primary>
+ <secondary>configuration parameter</secondary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Controls the use of deduplication within B-Tree indexes.
+ Deduplication is an optimization that reduces the storage size
+ of indexes by storing equal index keys only once. See <xref
+ linkend="btree-deduplication"/> for more information.
+ </para>
+
+ <para>
+ In addition to <literal>off</literal>, to disable, there are
+ two modes: <literal>on</literal>, and
+ <literal>nonunique</literal>. When
+ <varname>btree_deduplication</varname> is set to
+ <literal>nonunique</literal>, the default, deduplication is
+ only used for non-unique B-Tree indexes.
+ </para>
+
+ <para>
+ This setting can be overridden for individual B-Tree indexes
+ by changing index storage parameters. See <xref
+ linkend="sql-createindex-storage-parameters"/> from the
+ <command>CREATE INDEX</command> documentation for details.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-bytea-output" xreflabel="bytea_output">
<term><varname>bytea_output</varname> (<type>enum</type>)
<indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index ec8bdcd7a4..695aa9123d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -887,6 +887,14 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
might be worthwhile to reindex periodically just to improve access speed.
</para>
+ <tip>
+ <para>
+ Enabling B-tree deduplication in unique indexes can be an effective
+ way to control index bloat in extreme cases. See <xref
+ linkend="btree-deduplication-unique"/> for details.
+ </para>
+ </tip>
+
<para>
<xref linkend="sql-reindex"/> can be used safely and easily in all cases.
This command requires an <literal>ACCESS EXCLUSIVE</literal> lock by
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 629a31ef79..abc7db4820 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -166,6 +166,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
maximum size allowed for the index type, data insertion will fail.
In any case, non-key columns duplicate data from the index's table
and bloat the size of the index, thus potentially slowing searches.
+ Moreover, B-tree deduplication is never used with indexes that
+ have a non-key column.
</para>
<para>
@@ -388,10 +390,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</variablelist>
<para>
- B-tree indexes additionally accept this parameter:
+ B-tree indexes also accept these parameters:
</para>
<variablelist>
+ <varlistentry id="index-reloption-deduplication" xreflabel="deduplication">
+ <term><literal>deduplication</literal>
+ <indexterm>
+ <primary><varname>deduplication</varname></primary>
+ <secondary>storage parameter</secondary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Per-index value for <xref linkend="guc-btree-deduplication"/>.
+ Controls usage of the B-tree deduplication technique described
+ in <xref linkend="btree-deduplication"/>. Set to
+ <literal>ON</literal> or <literal>OFF</literal> to override GUC.
+ (Alternative spellings of <literal>ON</literal> and
+ <literal>OFF</literal> are allowed as described in <xref
+ linkend="config-setting"/>.)
+ </para>
+
+ <note>
+ <para>
+ Turning <literal>deduplication</literal> off via <command>ALTER
+ INDEX</command> prevents future insertions from triggering
+ deduplication, but does not in itself make existing posting list
+ tuples use the standard tuple representation.
+ </para>
+ </note>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
<term><literal>vacuum_cleanup_index_scale_factor</literal>
<indexterm>
@@ -446,9 +477,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
This setting controls usage of the fast update technique described in
<xref linkend="gin-fast-update"/>. It is a Boolean parameter:
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
- (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
- allowed as described in <xref linkend="config-setting"/>.) The
- default is <literal>ON</literal>.
+ The default is <literal>ON</literal>.
</para>
<note>
@@ -831,6 +860,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
</programlisting>
</para>
+ <para>
+ To create a unique index with deduplication enabled:
+<programlisting>
+CREATE UNIQUE INDEX title_idx ON films (title) WITH (deduplication = on);
+</programlisting>
+ </para>
+
<para>
To create a <acronym>GIN</acronym> index with fast updates disabled:
<programlisting>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 10881ab03a..c9a5349019 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -58,8 +58,9 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURR
<listitem>
<para>
- You have altered a storage parameter (such as fillfactor)
- for an index, and wish to ensure that the change has taken full effect.
+ You have altered a storage parameter (such as fillfactor or
+ deduplication) for an index, and wish to ensure that the change has
+ taken full effect.
</para>
</listitem>
diff --git a/src/test/regress/expected/btree_index.out b/src/test/regress/expected/btree_index.out
index f567117a46..53bcd1f30a 100644
--- a/src/test/regress/expected/btree_index.out
+++ b/src/test/regress/expected/btree_index.out
@@ -266,6 +266,22 @@ select * from btree_bpchar where f1::bpchar like 'foo%';
fool
(2 rows)
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+ FOR r IN 1..1350 LOOP
+ DELETE FROM dedup_unique_test_table;
+ INSERT INTO dedup_unique_test_table SELECT 1;
+ END LOOP;
+END$$;
--
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
diff --git a/src/test/regress/sql/btree_index.sql b/src/test/regress/sql/btree_index.sql
index 558dcae0ec..f008a5a55f 100644
--- a/src/test/regress/sql/btree_index.sql
+++ b/src/test/regress/sql/btree_index.sql
@@ -103,6 +103,23 @@ explain (costs off)
select * from btree_bpchar where f1::bpchar like 'foo%';
select * from btree_bpchar where f1::bpchar like 'foo%';
+--
+-- Test deduplication within a unique index
+--
+CREATE TABLE dedup_unique_test_table (a int) WITH (autovacuum_enabled=false);
+CREATE UNIQUE INDEX dedup_unique ON dedup_unique_test_table (a) WITH (deduplication=on);
+CREATE UNIQUE INDEX plain_unique ON dedup_unique_test_table (a) WITH (deduplication=off);
+-- Generate enough garbage tuples in index to ensure that even the unique index
+-- with deduplication enabled has to check multiple leaf pages during unique
+-- checking (at least with a BLCKSZ of 8192 or less)
+DO $$
+BEGIN
+ FOR r IN 1..1350 LOOP
+ DELETE FROM dedup_unique_test_table;
+ INSERT INTO dedup_unique_test_table SELECT 1;
+ END LOOP;
+END$$;
+
--
-- Test B-tree fast path (cache rightmost leaf page) optimization.
--
--
2.17.1
v25-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchapplication/x-patch; name=v25-0001-Remove-dead-pin-scan-code-from-nbtree-VACUUM.patchDownload
From a5c2da1fb4c9b528bc2ea5563cc74b65a5fcc8c5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 20 Nov 2019 16:21:47 -0800
Subject: [PATCH v25 1/4] Remove dead "pin scan" code from nbtree VACUUM.
Finish off the work of commit 3e4b7d87 by completely removing the "pin
scan" code previously used by nbtree VACUUM:
* Don't track lastBlockVacuumed within nbtree.c VACUUM code anymore.
* Remove the lastBlockVacuumed field from xl_btree_vacuum WAL records
(nbtree leaf page VACUUM records).
* Remove the unnecessary extra call to _bt_delitems_vacuum() made
against the last block. This occurred when VACUUM didn't have index
tuples to kill on the final block in the index, based on the assumption
that a final "pin scan" was still needed. (Clearly a final pin scan
can never take place here, since the entire pin scan mechanism was
totally disabled by commit 3e4b7d87.)
Also, add a new ndeleted metadata field to xl_btree_vacuum, to replace
the unneeded lastBlockVacuumed field. This isn't really needed either,
since we could continue to infer the array length in nbtxlog.c by using
the overall record length. However, it will become useful when the
upcoming deduplication patch needs to add an "items updated" field to go
alongside it (besides, it doesn't seem like a good idea to leave the
xl_btree_vacuum struct without any fields; the C standard says that
that's undefined).
Discussion: https://postgr.es/m/CAH2-Wzn2pSqEOcBDAA40CnO82oEy-EOpE2bNh_XL_cfFoA86jw@mail.gmail.com
---
src/include/access/nbtree.h | 3 +-
src/include/access/nbtxlog.h | 25 ++-----
src/backend/access/nbtree/nbtpage.c | 35 +++++-----
src/backend/access/nbtree/nbtree.c | 74 ++-------------------
src/backend/access/nbtree/nbtxlog.c | 95 +--------------------------
src/backend/access/rmgrdesc/nbtdesc.c | 3 +-
6 files changed, 28 insertions(+), 207 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 18a2a3e71c..9833cc10bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -779,8 +779,7 @@ extern bool _bt_page_recyclable(Page page);
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed);
+ OffsetNumber *deletable, int ndeletable);
extern int _bt_pagedel(Relation rel, Buffer buf);
/*
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index 91b9ee00cf..71435a13b3 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -150,32 +150,17 @@ typedef struct xl_btree_reuse_page
* The WAL record can represent deletion of any number of index tuples on a
* single index page when executed by VACUUM.
*
- * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
- * For a non-MVCC index scans there is an additional correctness requirement
- * for applying these changes during recovery, which is that we must do one
- * of these two things for every block in the index:
- * * lock the block for cleanup and apply any required changes
- * * EnsureBlockUnpinned()
- * The purpose of this is to ensure that no index scans started before we
- * finish scanning the index are still running by the time we begin to remove
- * heap tuples.
- *
- * Any changes to any one block are registered on just one WAL record. All
- * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
- * starting from the last block vacuumed through until this one. Individual
- * block numbers aren't given.
- *
- * Note that the *last* WAL record in any vacuum of an index is allowed to
- * have a zero length array of offsets. Earlier records must have at least one.
+ * Note that the WAL record in any vacuum of an index must have at least one
+ * item to delete.
*/
typedef struct xl_btree_vacuum
{
- BlockNumber lastBlockVacuumed;
+ uint32 ndeleted;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /* DELETED TARGET OFFSET NUMBERS FOLLOW */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(uint32))
/*
* This is what we need to know about marking an empty branch for deletion.
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869a36..66c79623cf 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -968,32 +968,27 @@ _bt_page_recyclable(Page page)
* deleting the page it points to.
*
* This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given itemnos *must* appear in increasing order in the array.
+ * Also, the given deletable array *must* be sorted in ascending order.
*
- * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
- * we need to be able to pin all of the blocks in the btree in physical
- * order when replaying the effects of a VACUUM, just as we do for the
- * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
- * intermediate range of blocks has had no changes at all by VACUUM,
- * and so must be scanned anyway during replay. We always write a WAL record
- * for the last block in the index, whether or not it contained any items
- * to be removed. This allows us to scan right up to end of index to
- * ensure correct locking.
+ * We record VACUUMs and b-tree deletes differently in WAL. Deletes must
+ * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
+ * can rely on the initial heap scan taking care of the problem (pruning would
+ * have generated the conflicts needed for hot standby already).
*/
void
-_bt_delitems_vacuum(Relation rel, Buffer buf,
- OffsetNumber *itemnos, int nitems,
- BlockNumber lastBlockVacuumed)
+_bt_delitems_vacuum(Relation rel, Buffer buf, OffsetNumber *deletable,
+ int ndeletable)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Assert(ndeletable > 0);
+
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
/* Fix the page */
- if (nitems > 0)
- PageIndexMultiDelete(page, itemnos, nitems);
+ PageIndexMultiDelete(page, deletable, ndeletable);
/*
* We can clear the vacuum cycle ID since this page has certainly been
@@ -1019,7 +1014,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
XLogRecPtr recptr;
xl_btree_vacuum xlrec_vacuum;
- xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.ndeleted = ndeletable;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1030,8 +1025,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* is. When XLogInsert stores the whole buffer, the offsets array
* need not be stored too.
*/
- if (nitems > 0)
- XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, (char *) deletable, ndeletable *
+ sizeof(OffsetNumber));
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
@@ -1050,8 +1045,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* Also, the given itemnos *must* appear in increasing order in the array.
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but the WAL logging considerations are quite different. See
- * comments for _bt_delitems_vacuum.
+ * the page, but it needs to generate its own recovery conflicts by accessing
+ * the heap. See comments for _bt_delitems_vacuum.
*/
void
_bt_delitems_delete(Relation rel, Buffer buf,
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c67235ab80..bbc1376b0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -46,8 +46,6 @@ typedef struct
IndexBulkDeleteCallback callback;
void *callback_state;
BTCycleId cycleid;
- BlockNumber lastBlockVacuumed; /* highest blkno actually vacuumed */
- BlockNumber lastBlockLocked; /* highest blkno we've cleanup-locked */
BlockNumber totFreePages; /* true total # of free pages */
TransactionId oldestBtpoXact;
MemoryContext pagedelcontext;
@@ -978,8 +976,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
vstate.callback = callback;
vstate.callback_state = callback_state;
vstate.cycleid = cycleid;
- vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
- vstate.lastBlockLocked = BTREE_METAPAGE;
vstate.totFreePages = 0;
vstate.oldestBtpoXact = InvalidTransactionId;
@@ -1040,39 +1036,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
}
- /*
- * Check to see if we need to issue one final WAL record for this index,
- * which may be needed for correctness on a hot standby node when non-MVCC
- * index scans could take place.
- *
- * If the WAL is replayed in hot standby, the replay process needs to get
- * cleanup locks on all index leaf pages, just as we've been doing here.
- * However, we won't issue any WAL records about pages that have no items
- * to be deleted. For pages between pages we've vacuumed, the replay code
- * will take locks under the direction of the lastBlockVacuumed fields in
- * the XLOG_BTREE_VACUUM WAL records. To cover pages after the last one
- * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
- * against the last leaf page in the index, if that one wasn't vacuumed.
- */
- if (XLogStandbyInfoActive() &&
- vstate.lastBlockVacuumed < vstate.lastBlockLocked)
- {
- Buffer buf;
-
- /*
- * The page should be valid, but we can't use _bt_getbuf() because we
- * want to use a nondefault buffer access strategy. Since we aren't
- * going to delete any items, getting cleanup lock again is probably
- * overkill, but for consistency do that anyway.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
- RBM_NORMAL, info->strategy);
- LockBufferForCleanup(buf);
- _bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
- _bt_relbuf(rel, buf);
- }
-
MemoryContextDelete(vstate.pagedelcontext);
/*
@@ -1203,13 +1166,6 @@ restart:
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBufferForCleanup(buf);
- /*
- * Remember highest leaf page number we've taken cleanup lock on; see
- * notes in btvacuumscan
- */
- if (blkno > vstate->lastBlockLocked)
- vstate->lastBlockLocked = blkno;
-
/*
* Check whether we need to recurse back to earlier pages. What we
* are concerned about is a page split that happened since we started
@@ -1245,9 +1201,9 @@ restart:
htup = &(itup->t_tid);
/*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
+ * During Hot Standby we currently assume that it's okay that
+ * XLOG_BTREE_VACUUM records do not produce conflicts. This is
+ * only safe as long as the callback function depends only
* upon whether the index tuple refers to heap tuples removed
* in the initial heap scan. When vacuum starts it derives a
* value of OldestXmin. Backends taking later snapshots could
@@ -1276,29 +1232,7 @@ restart:
*/
if (ndeletable > 0)
{
- /*
- * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
- * all information to the replay code to allow it to get a cleanup
- * lock on all pages between the previous lastBlockVacuumed and
- * this page. This ensures that WAL replay locks all leaf pages at
- * some point, which is important should non-MVCC scans be
- * requested. This is currently unused on standby, but we record
- * it anyway, so that the WAL contains the required information.
- *
- * Since we can visit leaf pages out-of-order when recursing,
- * replay might end up locking such pages an extra time, but it
- * doesn't seem worth the amount of bookkeeping it'd take to avoid
- * that.
- */
- _bt_delitems_vacuum(rel, buf, deletable, ndeletable,
- vstate->lastBlockVacuumed);
-
- /*
- * Remember highest leaf page number we've issued a
- * XLOG_BTREE_VACUUM WAL record for.
- */
- if (blkno > vstate->lastBlockVacuumed)
- vstate->lastBlockVacuumed = blkno;
+ _bt_delitems_vacuum(rel, buf, deletable, ndeletable);
stats->tuples_removed += ndeletable;
/* must recompute maxoff */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 44f6283950..72a601bb22 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,107 +386,16 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
- /*
- * This section of code is thought to be no longer needed, after analysis
- * of the calling paths. It is retained to allow the code to be reinstated
- * if a flaw is revealed in that thinking.
- *
- * If we are running non-MVCC scans using this index we need to do some
- * additional work to ensure correctness, which is known as a "pin scan"
- * described in more detail in next paragraphs. We used to do the extra
- * work in all cases, whereas we now avoid that work in most cases. If
- * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
- * additional work required for the pin scan.
- *
- * Avoiding this extra work is important since it requires us to touch
- * every page in the index, so is an O(N) operation. Worse, it is an
- * operation performed in the foreground during redo, so it delays
- * replication directly.
- *
- * If queries might be active then we need to ensure every leaf page is
- * unpinned between the lastBlockVacuumed and the current block, if there
- * are any. This prevents replay of the VACUUM from reaching the stage of
- * removing heap tuples while there could still be indexscans "in flight"
- * to those particular tuples for those scans which could be confused by
- * finding new tuples at the old TID locations (see nbtree/README).
- *
- * It might be worth checking if there are actually any backends running;
- * if not, we could just skip this.
- *
- * Since VACUUM can visit leaf pages out-of-order, it might issue records
- * with lastBlockVacuumed >= block; that's not an error, it just means
- * nothing to do now.
- *
- * Note: since we touch all pages in the range, we will lock non-leaf
- * pages, and also any empty (all-zero) pages that may be in the index. It
- * doesn't seem worth the complexity to avoid that. But it's important
- * that HotStandbyActiveInReplay() will not return true if the database
- * isn't yet consistent; so we need not fear reading still-corrupt blocks
- * here during crash recovery.
- */
- if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
- {
- RelFileNode thisrnode;
- BlockNumber thisblkno;
- BlockNumber blkno;
-
- XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
- for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
- {
- /*
- * We use RBM_NORMAL_NO_LOG mode because it's not an error
- * condition to see all-zero pages. The original btvacuumpage
- * scan would have skipped over all-zero pages, noting them in FSM
- * but not bothering to initialize them just yet; so we mustn't
- * throw an error here. (We could skip acquiring the cleanup lock
- * if PageIsNew, but it's probably not worth the cycles to test.)
- *
- * XXX we don't actually need to read the block, we just need to
- * confirm it is unpinned. If we had a special call into the
- * buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned. Optimizing
- * this is now moot, since in most cases we avoid the scan.
- */
- buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
- RBM_NORMAL_NO_LOG);
- if (BufferIsValid(buffer))
- {
- LockBufferForCleanup(buffer);
- UnlockReleaseBuffer(buffer);
- }
- }
- }
-#endif
-
- /*
- * Like in btvacuumpage(), we need to take a cleanup lock on every leaf
- * page. See nbtree/README for details.
- */
if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer)
== BLK_NEEDS_REDO)
{
- char *ptr;
- Size len;
-
- ptr = XLogRecGetBlockData(record, 0, &len);
+ char *ptr = XLogRecGetBlockData(record, 0, NULL);
page = (Page) BufferGetPage(buffer);
- if (len > 0)
- {
- OffsetNumber *unused;
- OffsetNumber *unend;
-
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
-
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
- }
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
/*
* Mark the page as not containing any LP_DEAD items --- see comments
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4ee6d04a68..497f8dc77e 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "ndeleted %u", xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
--
2.17.1
On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
* Decided to go back to turning deduplication on by default with
non-unique indexes, and off by default using unique indexes.The unique index stuff was regressed enough with INSERT-heavy
workloads that I was put off, despite my initial enthusiasm for
enabling deduplication everywhere.
I have changed my mind about this again. I now think that it would
make sense to treat deduplication within unique indexes as a separate
feature that cannot be disabled by the GUC at all (though we'd
probably still respect the storage parameter for debugging purposes).
I have found that fixing the WAL record size issue has helped remove
what looked like a performance penalty for deduplication (but was
actually just a general regression). Also, I have found a way of
selectively applying deduplication within unique indexes that seems to
have no downside, and considerable upside.
The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.
This heuristic makes sense because deduplication within unique indexes
should only target leaf pages that cannot possibly receive new values.
In many cases, the only reason why almost all primary key leaf pages
can ever split is because of non-HOT updates whose new HOT chain needs
a new, equal entry in the primary key. This is the case with your
standard identity column/serial primary key, for example (only the
rightmost page will have a page split due to the insertion of new
logical rows -- everything other variety of page split must be due to
new physical tuples/versions). I imagine that it is possible for a
leaf page to be a "mixture" of these two basic/general tendencies,
but not for long. It really doesn't matter if we occasionally fail to
delay a page split where that was possible, nor does it matter if we
occasionally apply deduplication when that won't delay a split for
very long -- pretty soon the page will split anyway. A split ought to
separate the parts of the keyspace that exhibit each tendency. In
general, we're only interested in delaying page splits in unique
indexes *indefinitely*, since in effect that will prevent them
*entirely*. (So the goal is *significantly* different to our general
goal for deduplication -- it's about buying time for VACUUM to run or
whatever, rather than buying space.)
This heuristic helps the TPC-C "old order" tables PK from bloating
quite noticeably, since that was the only unique index that is really
affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that
table happen to not be HOT-safe in general, which is not the case for
any other table). It doesn't regress anything else from TPC-C, since
there really isn't a benefit for other tables. More importantly, the
working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.
This 2x space saving seems to occur reliably, unless there is a lot of
contention on individual *pages*, in which case the bloat can be
delayed but not prevented. We get that 2x space saving with either
uniformly distributed random updates on pgbench_accounts (i.e. the
pgbench default), or with a skewed distribution that hashes the PRNG's
value. Hashing like this simulates a workload where there the skew
isn't concentrated in one part of the key space (i.e. there is skew,
but very popular values are scattered throughout the index evenly,
rather than being concentrated together in just a few leaf pages).
Can anyone think of an adversarial case, that we may not do so well on
with the new "only deduplicate within unique indexes when new item
already has a duplicate" strategy? I'm having difficulty identifying
some kind of worst case.
--
Peter Geoghegan
On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page.
the working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.
I have numbers from my benchmark against my working copy of the patch,
with this enhanced design for unique index deduplication.
With an extra index on pgbench_accounts's abalance column (that is
configured to not use deduplication for the test), and with the aid
variable (i.e. UPDATEs on pgbench_accounts) configured to use skew, I
have a variant of the standard pgbench TPC-B like benchmark. The
pgbench script I used was as follows:
\set r random_gaussian(1, 100000 * :scale, 4.0)
\set aid abs(hash(:r)) % (100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;
Results from interlaced 2 hour runs at pgbench scale 5,000 are as
follows (shown in reverse chronological order):
master_2_run_16.out: "tps = 7263.948703 (including connections establishing)"
patch_2_run_16.out: "tps = 7505.358148 (including connections establishing)"
master_1_run_32.out: "tps = 9998.868764 (including connections establishing)"
patch_1_run_32.out: "tps = 9781.798606 (including connections establishing)"
master_1_run_16.out: "tps = 8812.269270 (including connections establishing)"
patch_1_run_16.out: "tps = 9455.476883 (including connections establishing)"
The patch comes out ahead in the first 2 hour run, with later runs
looking like a more even match. I think that each run didn't last long
enough to even out the effects of autovacuum, but this is really about
index size rather than overall throughput, so it's not that important.
(I need to get a large server to do further performance validation
work, rather than just running overnight benchmarks on my main work
machine like this.)
The primary key index (pgbench_accounts_pkey) starts out at 10.45 GiB
in size, and ends at 12.695 GiB in size with the patch. Whereas with
master, it also starts out at 10.45 GiB, but finishes off at 19.392
GiB.
Clearly this is a significant difference -- the index is only ~65% of
its master-branch size with the patch. See attached tar archive with
logs, and pg_buffercache output after each run. (The extra index on
pgbench_accounts.abalance is pretty much the same size for
patch/master, since deduplication was disabled for the patch runs.)
And, as I said, I believe that we can make this unique index
deduplication stuff an internal thing that isn't even documented
(maybe a passing reference is appropriate when talking about general
deduplication).
--
Peter Geoghegan
Attachments:
overnight-benchmark.tar.gzapplication/x-gzip; name=overnight-benchmark.tar.gzDownload
� �Q�] ����.�����su�=$#��S�W�(��6`K�~���\{1������>�p������1�����'J���\~M�����C�������_~�����������o����ZSJ������O��?�?��?���k���?���L����?�d�����z�%����K��{���?�������*������?�����?��wz�{�_�������������S����_~�������_���?�j������O�Xc��o���u�������������%��1�����������_��w����?~����~�����_������o����������������Y�����������/��o�����������#��V~�f]�Y+�a����fgX�?��.��?���0o�G~+�����cX�Vj�Q��<M}ZI�a�+�m^�%����~�����mb�,�a�L�0����s�<SR1�L�a�9,�3.O<.�������m'���:����������u>�Ki_�1.������c*=e��C�����c+��g�q#��c,}�&�K:��=m,k����:N�s����\���5����4�|����k!��|�+��2R��g?�������m/�xz�t[������W��e��q3���l�q��2��9/�y�x��;?���8o�}O�,��>�7�{)��2��������;|���k?6��W�������������w�x���y�sV��a��s�{����<����w�c/���h����oT�P�����l��>�����-%=��}��\�����8�y����K+����Kjg���;�2�����j/�?��{�>t\�:�7��q~����R>�;�2;�CE@0��{:�������������B�>�����2K������v�v��������>�V� �������E�71�S~���Lk����~2���2�P{1������^L����;����2��������� �Z��j���U��~b����^�� Y�X�����K[;L���]�"����������(�����^�r��,�����o���~~MCy����K���O��z��j����S�\���S����:n>c���5/2��u;������_��=�{���`�v�/��KM�v��S�_;�n[+J�FY��~d�q��R�9������>��J~����5~���qa/y~^�+z�;��e�K�c����f�'���w������5�C\.�n?�2
�R�������������!qV����\���������RS���k��u�Ze��_�yS���M�7M�-��G��������A@�qplU�30��B"�LO('�>c��xbk��:�������F��e���������~.~#��?���3����6����!=�X��3�����;j<�a� B�.����6��4p)��a9e�
}U���?M|��D�!�`�,����k�t��6�5����p���m1f�}�$q�]!����eM���].��8�����������,�|�V���&�hj��$3VS�a��uJ���%��_��VS���
cH�����Run�knd���nz�7]�Cy�c-�x��p�����DsJ�p�#�h(%|�.�S
�3�8���
2%F��j�;���K �l����K��7�P<��x���%���g�Tu�l��A�X1����sI�2������2���#-�@�Pm5�������o�G^";F&>Sf��y�0��A���)x�
BT�q��������.�%���&���w� 65�Q�����i��)?�B�p}d���?_v�D���|�5�%�#1�"�� �O2DY#��?�Q���N�-n-��g�0 ���B����5]#�v������W�$uxMmN�!��U]�������gBp�]G�
����� i�d��F<'��E�����Nm�AM\]��z����yV�����j�&�J>2q�`i�����d`���+�7
xb��3��:���������~���jy�S{�������<.(-K�������@�7;R��@.z�\h�����7�WbS���z������jwC��������|�NH���l<���W���lz}�a?��ZA���n`�3�������)^��~z����I�.��'*���P����x�@[�x
[�cf���W� ��Nm��3[�����#"�}��J�Y��##��Y���{��e�j�o������F�����Bi:C��4>b�z����xf���cBe�}EF:�
�~�����l��-?�t��
5��J�C���_My�����j]���p�C�>U�r
���kb���/P���:�e�xs���tGV|��a�qy�>������������u�U��x�=���"L��:���b����5&����+���vfG�3�*���sg���
U=7�\�����s�wG*������G�6!�X��Vlo�&
����^D6���1�T�w�8��Z�{�����������8���n���Z���������0N���m�g����y��uj�YJXb����!��lIE������n!=��Z����i���li�hi(�G��
���3X(��j�=�y�%6���3B~^���~X��&�0"���t���o�3� ������B��r�L�L��r����hmc����&����]C����d`Y7z��qi5��C#�1���������k?�a�8�zwyD��6]�����,���<%�0��#D7kUo-S�����w�������4W
I�\?�P�--7�?y��^/�h���&B}j�^x���a��7�S����1F6�&�n���2�>S-�$L,!G�g&�]��5������S������r�rz�gB���o�X��*H��������YSH$���u�X����Zm�\��G��t�u<��"*G�n� �Sso�}RB���X���R
�K�������2������aEG��W$G1���1�n�V����P��M�;���#���nkd1��m�=l����ZC��Y��6](����lw�������qt����ojF��k[q���i
������=z����I�P����j����B����� ��P�W���-H�4I�/�F/�����j��WB�^�G�2�P��F������oJAC 5�*}����b�5��izu��5z�-���?���N��F3�Lw�Rn�pFa4�E�*rP���^��G��6q�S����I\� �Hh���HE-��<��U�&���z��~�F�������3
|�A)da@;Y�p����0������W������w��%�����1 /�E�Bg���H��}�w�J�$�-!I�,�F�������iP5�T�3I�*!I�}@=�nm?��G����Z�����H��<QP�8i�R�|P�"]�u�SE�ey-�q��V������m���b���� k
#�D�d�P��q��q`(r�T���b���CPB����1���-W4�
E������B��Z���#�wz�w\����$E��Hs�������]����c�.�����Jp���$v�S�I�`�
�#��&���r7ZB��I�6���c� ��`/O�������
'�[1�����U���� %i��:�L�5�
Az��:'b/��.!H�L�[�@��k�4?�#���BWB�FBjE��6!}����nO��C�����{���3�|�F!H���u�c�l�:�~
h$��]�!H{������.���Ud1e��aU�k@��@x�]�6�H�w>�:C��jJE�eF�Zq}���!�Q��V��P�?�D�R�7��YB���;NS�
�#H����
�)���GWK��Z���t�-��7��M���ddX�H�&RB,��������H�~�R�uAk��';0�����������"I3%r�k���3���nc!I�3D���=$�5C\�p�&*o���L��x@���2�4���Z�@�q����
�4E��#IW3ZmR�����k���YT
��*��H��Y3��.��B���P��EC�y]���*� \�ZA���3-���H�9 z�\�ncG��6G(H��ebG����^/x�$�v�����O���w(����";����Z���'>bt��u=2z�'�������jIxU��u0*^5�~`6��'-q�=�8�U�����i:��VTPx�]:���`��4�I�(����S����{WC�!���=���>���w .����C,���x��Qy;�:����G�?�E��"�9�=���
�����r�=����+0u)i�{��TC)�@%����#A+��u���bNB�ro;yR�`�XE�#ha*� H>:��>�LN�2��Cb=J b��� ��i������F�n�������Y����1u���lk�r�g�W=���M`~�T�`�� �/2R�$[�"�P(HCo=�hTv�g[��������qG20�g�j��W�g����3���Z!�����T�g��7U�Q�[v��*$9p}�#�r:�c�7���M�MfR/7�<���V��H�"������a���Y:,r�;���cH�j��On8�t9�X�3��Y�w��|F�8��W�Y
~�k��z"�B���#7��ls[���E��#7��s����2^7�@h��Y��0��?�Y�!�W�E(0 B��A�h5���)I��Fc�r2� �����6��0��
���r�-P�44$���$5{i� J����NNC��k��������7��SE�Mg5���2�2�[�y��G��tuK>"�
��]'���$����l���\W�[�?H��x({�����#[��\s����]����`<2�I�#���7J��G�����:�G&\�j�����Uf�e����Ft-��pG��d$5�(>-�e��+H������#���e"s��s �U����N %���h\"s����*W����?��9��c���y�Y� N�������t����$]�=N�+M
�V�������c��3���o�a��������9���s���U!����FA�lc��0��m6��K��v���d��hv:�����s2�E����9 VM�@�����6���A�{5`3lD��b��M=�n�mpa��#���H���
���� *����c���r{�}��<S��[��_�
h������ �7x��z��w�K��,����4����Y�<������~(� �S�h�%���5�W��� ��vu�2{��`Z�����k��ft�i%�E�s���?G""mY�hNd�Nr;$@�Hw�?�D��d�.w�~��*�WPLl���l�
}���~H�'�kz��mt��4$�
��0�O0z����&�������td)�����w�;ZTB&s�d��d3�%;�k��^��w��uT�j%�G�sv��j5���J����9q���%���w����=��pT&�PmC�%�0�{:�7��d��Jq��K�����9�vVuBYf�(h�� a� ���������7)����Gm��A�:�u�4��6��=mE�c��r����X�"�dYN.~���j��*)~��,�@�Id;���x�K�#��6�;Q��&��e��z����I��v�$�t������JDC*A��/�i�$8SY'I-�������m
^Lh*c���^;�gh{S��]TQ_�QG5�$�#����
�A"��.#���w������G�s)��L,�����0{�!�C�N�t��s-ag1����=cC����o������H��Hw.��i���V�q�,�RPj!h��>RR��v��c%O�i��V����U^x� ����/��� �-�H�^�=)��8jE�E�O�$�t�RYYI4v�`&��Kk#_��bA��g��)D� ���J��,u���S �&��t�^q�b��*�^��Q���q����!EW����B�;$!���� �I�����[�n0�Qeq��;{�J���x�;����^()H����jBj��h�w���|��U�*n�3�-z��B��GCQ����:b�
�B�E~�=!$���VL��d��+� ,]�ERt�^0��B<��/����#/���Rt���q�������(�����(9t"x�>���m�������-@>� H�dS ���#E��Q6����&���#E/��j��U�15�#E�]� ��}(nj0�����W~Kq��F��_����.|c�"����A��f�i2�����P�m�����mS;5D�r!������.}#3���I4d�l��U��8�Fa��A�]'�c�n M�U��f����u71��cb������}%/�
X&�����[-���Y�����p@$� �B�3|>#w���F=�:Z��d�4��K ��� �����Q-�0�,����x���h�V�/c��h�1�6�>�����tj��xU&���k4n(l !�1�#
by���-��z���(|Yb^B�@���
*������E��z���P�L���\(wL5�f�6��~1����S��5]j���F����'J���@=z����3�?���)�>��@�3���o�T �3KP�~�K��, @��H�R/}c'giI!��o7���N�"���Q���K��������^���"a^O[�\����P�� 1TL/�h���b;0A��\���l���3� �I�^�Ny��A��L�%��p)����0g��A���S/�R-3R��[U��a{C�"!�l����
2�?��'U���G���m%
���*�3����-g�����`���uF[� �+��K����:�
��� �����QC����R�G��5�h�����8��=z��N��'fZ
�r$^��������yEWggM����B�
# Vg�"�_�O9zb�f���Z�Dl�G�^'0Qj��e��#H��!L��!�5r��=�\��I����!G���o��iUs9�6�R $%W}���"�j���g�B����l����O�,c�S��o��8�BP���C��W�BA��v������nF�$XC��-�Z�W�BD�����$G0K����$8=zt������[A�9��]U� _HM/��<7!z�XP�^/|#��O������P����X�"RC�P�}C��h�MS���������
x
9��.�d�����7TG�������ol6�.�JV�<��7�e�f)jd���b~4��:���/����.��H��P?D��{4D}�f�fK���{�(���3�����N�������}c���t�=�����Q��Q��3���3J��WqzG��mr~p���/��W�����uF�y��F��"�h�5��gg�E�����{#��n���J1���pr���]�L�v!��"����IA�+#���Q3���/YA�l.���o��ds���$�C�O��>�����$V��K��t� h�|0w�d��r�L��rt��kS����=I#h ~X�g^ �K�����y���3����L�]zWC�v�����P�e�'~[P�� R�����t�X���1���m�T+9�&^Ug��E�=�h�=��MBN�$���}g����b]/�k��_�n%��HG�n5����{�����RC���
��TpP�����
�����O!G��a�5������
gj�!ws����m]����;����f�r51plp�������1p�����H%#}�hwGm��]���j9*�v����������X���l���K���z�O�}����9A�����4-����uV��U�s�wm��sA��oL�Y��x��>����B'��C�C�l3�;���}S(r{���Uc����z� �s}�a($M7�Lc f�z�7~6R|>�������Qt��W����9��QT��W��� �M~��U_�>�-�?i{�'6X����C���\~����C�
H�[!�4`iG~n��-�&8�����lUmu9K��
��#>[�d����]����#���n�V����~�',�GK�o�Q �1�yt}��g���J����<M�y����<�A�[C��P�9��Xs�%�
e�S�����}������������5w%S��vT�1'�c/Z/Z�7�iPHrrN��nu%����H��� �vs(���;�����<���*x��x�
�Q�:�k��
�D��It8�s�E�>oU��E�j��@�O{���:Go��L>���a������s���j���O<>g�
����mGl��i�Y�Wm�K��"�VL�h�������Eq�z����7^�F�="�S2�[�7�@+����]`-�������j���h�F�Tk���o4.+����k�*a��",��o��k�F��� �Fe��sr�Y����E�9�����J���{w�15�#1w~��.*�i�Q�yE��5 ����9%���# Rw��o�g$V���64�&Tz`~.~�5�7�vYGF����R�`TCmGcn>wD�M������s@Gw��������rf��\��|�L��T�����'a�����oxc�� ���
R�����.=�}N�-���E_�\Q��S���������v���{�G�T����X�'�F�!2��M������zM��N�����~M�_�m/������#��v �A�kM��6�#2���%�":Zk��1�}�������}�$3�Z�(�!@{tH��k��������:{���&:��P~o��������dq�j�2�}�V�E��%}�=,����r���� �u)���BeN�����P���p����= l(T�
4 �}s���y��`\�}B �#2��`�yj��T��h�kM'M}I�`S!�H�+���r"�2��yv�71U��p�2�A�5��9��
��)�pdJ����LF�5�B�2���G�s.������)##W�x]N�D��;#W���G��ed������_�dd����w���=_4�m�N�0����p������b6�(zF��������R����+
/���'i7/B��X�����zh#k��3o��D_LS��
�a���ut�R*m���\���+>�G��:�,��`22R����R�^���^ R��g�D$=��-z|���u~"�9m,��� �Sj'�Ept��v�j��{��m�:�XB#����������h�_�d�~�5�j/��O���i{:C7
��~h����-����k�so�6��;��C�l**K�G�s�LOk�'}�M746��'�rZ���\RFF�j�v � �/���rU� ��,�%A����^4�mJ`��g�Z�:M��&��f.�Yl����/�#Ol~�5�B���aL�����N�+!7/`����>/����'���RxBiI4<�kA5�A{���d����l����T���/�Nz��x�~Th�5D�]\�����6d\o���SN�= ���"�g�3�t~p��r�/�c_K�H��hFJ� ����d�~�@R^Y*�tF�J�'Q�u��Hg��{����Y1���*�B�8v.�R����X�<�� ��](�z��G����O�SVM�I���D3�_�����X��R��M�G�^�F��D������A������+/��?&���j�_�kq2���N�5���(4����~�] _��A�7�y�w���KFf�����a����GC����]%��� �wAFz�+j ���mE���1C+��� z�rj���������.�EW�nlM�$%-�G��k����g&
-� �U�%��Yq���~�h�y9��^�`����N���`�/�c7��e�j���rt���@3�`�o����
YP�G�s�B-��qT-:�*("8j�����f�����_����jVH���t��&e^1WE��z���U����QL���;�������M���������'�\l�#G����z����1��v�����3����=f�(�_����p���"8vq���N��c^�&�)�'N�m�"8vi���B�r=7�������ej�z2��Ut�B"
������w�mM�N�t���(��}��znz+��bw��G�^;�8��ru�]����b�Rz���e�����Lm��G�eU��b�#]b��!t��3�!lN)���HL�'���0���U�(�0�;���U��SF�.��p�4�.��I��gY���.*���0�&� P�Tudd���$U��\���H���}C���*�qiG���Vp�=����M�AE:�v��1�`b3�]`�"8�'��AC<�Ep�pZ�-�g�q��89R��������.A�E�^��r��H/G���2i�����������S�r�b�!*�0���3�������!H��_�U� �}:� ��.M���T��O�s����^� T�d� ��%��1��z\���v��%�x� �����������;�R+>o��w���J��iM?��|��=��^~�$�U��#�~[��_'��u��+��t��o�G�mDG�:��h�y�M�H��GN�3�z��������4� ]�n�A��1�hC����JG��8G����i�^@��}�86����h8_D�K�(���P�����]�D��H��`]�d��P�}�r����h���z[�P����5��$�i�����y[�]2�F}����P]�d���I��=��+���
,��+P�#���|�v��Q����D���~ ?2%�aD~��P? ���D0"?� T���J?B�v#O��+��uF�q4�%$�J�����f��Y�<un�.F^!O����g�''T�U.���+t��^����n](���sE��k���.����u7��N'h�6��=�
����/O,��1��~�nGo(6�� lB��U�dJ��F�}���q��f
������'Q�<�g��I�h��A������Xy����7���}������E�
��0��k- &'��H��h����]
���� 2wk����-�z���a��'���J�mL�B�k��G&5���}9�.��h��k7T����S���{�F�AI6�l�?3O>X
���C��fU,O��5.~��^!wk��#1�w.6�<����y�S 5�z��$E� U�k�q���[�f{v�d�f���AG�����q�������I�������7����5��N: >�\�����n��0d%j��:�����?=��@A��f�]�(V�������4p�U��)]�g��@��H.�|�7�?�Pt������KU�����P��o���.o^�F5��������������6&�3u�Q��U<�����
S��������iA������y��
���,C���"�������|���1��l�����bUE*�G|n���Z�����%f�7KX;2�N�(��;��� �\4x��}�k��=�FB;2�g�7LK#6$�z�Cf�DVc�/�R9��<��� f��v�y8�#��g��Q�����Uc���q��L����4��b��O<�S�n/�(��e^�K�n?��D�����a9��z�g��s ���~>�((m�r�������cr�<��fU�����20,���*��������&�\3��l�`k?�c�w}�(d����,+�kH@7#�M^��<��[r�7v�D%a�����9��<v!n���W���<J�������Mu�!4'��������Gh;���M����yt��}P���Rl.���X��7��yn���W*<�FHGf��W�v�"@a�Q�g}
�,����rZ��`���Y
��)�7P(j����p�/�G�a�eo�w�����0/{�(C���J���
��=C)��(����U`�� ���b ��@+C.f���<���O��H7�ur�vU
���z�����[�\�
B�f2��lM����
U������-�����#/{F���kWn�G$�<������L9�� o�+}��|���u]y�S���T��w���M7���t�sO�jJR�GU�����!�Z�+�C�������g���H��e�������x��%�-�b7P����P��fwde�,b�����W:��so�#
)iY���\�rf�%���7�9Mx2;5�rq��� �P$�����.Uzt�&�>�����L�]��n�/<���#Oz�.���4�m\��\O�#.���bE��:�K���,�Tb�����w
u�1��M����U3�0q�;aT�$l�����n��%���y�e����M��5[s���:6h��,���B7:K�f�8�G[����lZ�7��27�F���2}�k@�I!�F�s3���K��l�'5�3�����ro��m/��qI9IeSG�rDL\245�P���
e_l�BQ��^�6��NT�h������_�kL�g�y�i���s$[�����WU
i�.������R���NM��j����k[��J�%�g�M���*�X�=E9�|}����Q����k����7��w+ ���6��s
��t���t�=bs��
#iG���Sd1IHNk����;c^Mpp���i"�9�."�$CM���@(@s�����s�����m|����d���3�mK^�N�w�P�������{���H�����yA.��oj���q8WMg���$E�s�O8^�=�=>���.@�z����o���e7q����.�������4c�����Qo�0�����~Q�
/�c�Y�����k�9*��ub�z�.�!��"t5��B�w��V�r6�S�
a��Q�����fA�m��jL���+�t(�%M�k�A����D^�8��>���������E����M)�����@����L��������jS������5����enjJG��<